Principles
vo) a Delle MO ann vialercitceys
Fa] alo A @xeleliate
Andrew J. Viterbi
Mi ilaam en @lanleles!
sine RE
smile ieee pts
Pea SENT
eo ecety .
ag Pen
Doe eng 2
ae
fet
at
Gis aes ue ‘
ae Sie
ne ie ies
Aces
Ral
vee
at
ae
PRINCIPLES OF
DIGITAL
COMMUNICATION
AND CODING
McGraw-Hill Series in Electrical Engineering
Consulting Editor
Stephen W. Director, Carnegie-Mellon University
Networks and Systems
Communications and Information Theory
Control Theory
Electronics and Electronic Circuits
Power and Energy
Electromagnetics
Computer Engineering and Switching Theory
Introductory and Survey
Radio, Television, Radar, and Antennas
Previous Consulting £ditors
Ronald M. Bracewell, Colin Cherry, James F. Gibbons, Willis W. Harman,
Hubert Heffner, Edward W. Herold, John G. Linvill, Simon Ramo, Ronald A. Rohrer,
Anthony E. Siegman, Charles Susskind, Frederick E. Terman, John G. Truxal,
Ernst Weber, and John R. Whinnery
Communications and Information Theory
Consulting Editor
Stephen W. Director, Carnegie-Mellon University
Abramson: Information Theory and Coding
Angelakos and Everhart: Microwave Communications
Antoniou: Digital Filters: Analysis and Design
Bennett: Introduction to Signal Transmission
Berlekamp: Algebraic Coding Theory
Carlson: Communications Systems
Davenport: Probability and Random Processes: An Introduction for Applied Scientists and
Engineers
Davenport and Root: Introduction to Random Signals and Noise
Drake: Fundamentals of Applied Probability Theory
Gold and Rader: Digital Processing of Signals
Guiasu: Information Theory with New Applications
Hancock: An Introduction to Principles of Communication Theory
Melsa and Cohn: Decision and Estimation Theory
Papoulis: Probability, Random Variables, and Stochastic Processes
Papoulis: Signal Analysis
Schwartz: Information Transmission, Modulation, and Noise
Schwartz, Bennett, and Stein: Communication Systems and Techniques
Schwartz and Shaw: Signal Processing
Shooman: Probabilistic Reliability: An Engineering Approach
Taub and Schilling: Principles of Communication Systems
Viterbi: Principles of Coherent Communication
Viterbi and Omura: Principles of Digital Communication and Coding
enol
Vahey
we
PRINCIPLES OF
DIGITAL
COMMUNICATION
AND CODING
Andrew J. Viterbi
LINKABIT Corporation
Jim K. Omura
University of California,
Los Angeles
McGraw-Hill, Inc.
New York St. Louis San Francisco Auckland Bogota
Caracas Lisbon London Madrid Mexico City Milan
Montreal New Delhi San Juan Singapore
Sydney Tokyo Toronto
PRINCIPLES OF DIGITAL COMMUNICATION AND CODING
Copyright © 1979 by McGraw-Hill, Inc. All rights reserved.
Printed in the United States of America. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any
form or by any means, electronic, mechanical, photocopying,
recording, or otherwise, without the
prior written permission of the publisher.
9101112 KPKP 976543
This book was set in Times Roman.
The editors were Frank J. Cerra and J. W. Maisel;
the cover was designed by Albert M. Cetta;
the production supervisor was Charles Hess.
The drawings were done by Santype Ltd.
Kingsport Press was printer and binder.
Library of Congress Cataloging in Publication Data
Viterbi, Andrew J
Principles of digital communication and coding.
(McGraw-Hill electrical engineering series: Communi-
cations and information theory section)
Includes bibliographical references and index.
1. Digital communications. 2. Coding theory.
I. Omura, Jim K., joint author. II. Title.
III. Series.
TK5103.7.V57 621.38 78-13951
ISBN 0-07-0675 16-3
CONTENTS
Preface xi
Part One Fundamentals of Digital
Communication and Block Coding
Chapter 1 Digital Communication Systems:
Fundamental Concepts and Parameters 3
1.1 Sources, Entropy, and the Noiseless Coding Theorem 7
1.2. Mutual Information and Channel Capacity 19
1.3. The Converse to the Coding Theorem 28
1.4 Summary and Bibliographical Notes 34
Appendix 1A Convex Functions 35
Appendix 1B Jensen Inequality for Convex Functions 40
Problems 42
Chapter 2 Channel Models and Block Coding 47
2.1 Block-coded Digital Communication on the Additive
Gaussian Noise Channel 47
2.2. Minimum Error Probability and Maximum Likelihood
Decoder 54
2.3. Error Probability and a Simple Upper Bound 58
2.4 A Tighter Upper Bound on Error Probability 64
2.5 Equal Energy Orthogonal Signals on the AWGN Channel 65
2.6 Bandwidth Constraints, Intersymbol Interference, and
Tracking Uncertainty 69
2.7 Channel Input Constraints 76
2.8 Channel Output Quantization: Discrete Memoryless
Channels 78
2.9 Linear Codes 82
Vii
Vill CONTENTS
*2.10 Systematic Linear Codes and Optimum Decoding for the
BSC 89
*2.11 Examples of Linear Block Code Performance on the
AWGN Channel and Its Quantized Reductions 96
2.12 Other Memoryless Channels 102
2.13 Bibliographical Notes and References 116
Appendix 2A Gram-Schmidt Orthogonalization and Signal
Representation 117
Problems 119
Chapter 3. Block Code Ensemble Performance Analysis 128
3.1 Code Ensemble Average Error Probability: Upper Bound 128
3.2 The Channel Coding Theorem and Error Exponent
Properties for Memoryless Channels 133
3.3. Expurgated Ensemble Average Error Probability: Upper
Bound at Low Rates 143
3.4 Examples: Binary-Input, Output-Symmetric Channels, and
Very Noisy Channels 151
3.5 Chernoff Bounds and the Neyman-Pearson Lemma 158
3.6 Sphere-Packing Lower Bounds 164
*3.7 Zero Rate Lower Bounds 173
*3.8 Low Rate Lower Bounds 178
*3.9 Conjectures and Converses 184
*3.10 Ensemble Bounds for Linear Codes 189
3.11 Bibliographical Notes and References 194
Appendix 3A _ Useful Inequalities and the Proofs of Lemma 3.2.1
and Corollary 3.3.2 194
Appendix 3B Kuhn-Tucker Conditions and Proofs of Theorems
3.2.2 and 323 202
Appendix 3C Computational Algorithm for Capacity 207
Problems 212
Part Two Convolutional Coding and Digital
Communication
Chapter 4 Convolutional Codes 227
4.1 Introduction and Basic Structure 227
4.2 Maximum Likelihood Decoder for Convolutional Codes—
The Viterbi Algorithm 235
4.3 Distance Properties of Convolutional Codes for
Binary-Input Channels 239
4.4 Performance Bounds for Specific Convolutional Codes on
Binary-Input, Output-Symmetric Memoryless Channels 242
4.5 Special Cases and Examples 246
46 Structure of Rate 1/n Codes and Orthogonal Convolutional
Codes 253
* May be omitted without loss of continuity.
CONTENTS ix
4.7. Path Memory Truncationa, Metric Quantization, and Code
Synchronization in Viterbi Decoders 258
*4.8 Feedback Decoding 262
*4.9 Intersymbol Interference Channels ye i
*4.10 Coding for Intersymbol Interference Channels 284
4.11 Bibliographical Notes and References 286
Problems 287
Chapter 5 Convolutional Code Ensemble Performance 301
5.1. The Channel Coding Theorem for Time-varying
Convolutional Codes 301
5.2 Examples: Convolutional Coding Exponents for Very Noisy
Channels 313
5.3. Expurgated Upper Bound for Binary-Input,
Output-Symmetric Channels 315
5.4 Lower Bound on Error Probability 318
*5.5 Critical Lengths of Error Events 322
5.6 Path Memory Truncation and Initial Synchronization
Errors 327
5.7 Error Bounds for Systematic Convolutional Codes 328
*5.8 Time-varying Convolutional Codes on Intersymbol
Interference Channels 331
5.9 Bibliographical Notes and References 341
Problems 342
Chapter 6 Sequential Decoding of Convolutional
Codes 349
6.1 Fundamentals and a Basic Stack Algorithm 349
6.2 Distribution of Computation: Upper Bound RRP)
6.3 Error Probability Upper Bound 361
6.4 Distribution of Computations: Lower Bound 365
6.5 The Fano Algorithm and Other Sequential Decoding
Algorithms 370
6.6 Complexity, Buffer Overflow, and Other System
Considerations 374
6.7. Bibliographical Notes and References 378
Problems 379
Part Three Source Coding for Digital
Communication
Chapter 7 Rate Distortion Theory: Fundamental
Concepts for Memoryless Sources 385
7.1. The Source Coding Problem 385
7.2 Discrete Memoryless Sources—Block Codes 388
x CONTENTS
7.3 Relationships with Channel Coding
7.4 Discrete Memoryless Sources—Trellis Codes
7.5 Continuous Amplitude Memoryless Sources
*7.6 Evaluation of R(D)—Discrete Memoryless Sources
*7.7 Evaluation of R(D)—Continuous Amplitude Memoryless
Sources
7.8 Bibliographical Notes and References
Appendix 7A Computational Algorithm for R(D)
Problems
Chapter 8 Rate Distortion Theory: Memory, Gaussian
Sources, and Universal Coding
8.1 Memoryless Vector Sources
8.2 Sources with Memory
8.3. Bounds for R(D)
8.4 Gaussian Sources with Squared-Error Distortion
8.5 Symmetric Sources with Balanced Distortion Measures and
Fixed Composition Sequences
8.6 Universal Coding
8.7. Bibliographical Notes and References
Appendix 8A Chernoff Bounds for Distortion Distributions
Problems
Bibliography
Index
404
411
423
431
445
453
44
459
468
468
479
494
498
513
526
534
534
541
547
553
PREFACE
Digital communication is a much used term with many shades of meaning,
widely varying and strongly dependent on the user’s role and requirements.
This book is directed to the communication theory student and to the designer
of the channel, link, terminal, modem, or network used to transmit and receive
digital messages. Within this community, digital communication theory has come
to signify the body of knowledge and techniques which deal with the two-faceted
problem of (1) minimizing the number of bits which must be transmitted over
the communication channel so as to provide a given printed, audio, or visual
record within a predetermined fidelity requirement (called source coding); and
(2) ensuring that bits transmitted over the channel are received correctly despite
the effects of interference of various types and origins (called channel coding).
The foundations of the theory which provides the solution to this twofold problem
were laid by Claude Shannon in one remarkable series of papers in 1948. In the
intervening decades, the evolution and application of this so-called information
theory have had ever-expanding influence on the practical implementation of
digital communication systems, although their widespread application has
required the evolution of electronic-device and system technology to a point
which was hardly conceivable in 1948. This progress was accelerated by the
development of the large-scale integrated-circuit building block and the
economic incentive of communication satellite applications.
We have not attempted in this book to cover peripheral topics related to
digital communication theory when they involve a major deviation from the
basic concepts and techniques which lead to the solution of this fundamental
problem. For this reason, constructive algebraic techniques, though valuable for
developing code structures and important theoretical results of broad interest, are
specifically avoided in this book. Similarly, the peripheral, though practically
important, problems of carrier phase and frequency tracking, and time synchroni-
zation are not treated here. These have been covered adequately elsewhere. On
the other hand, the equally practical subject of intersymbol interference in
xi
Xii PREFACE
digital communication, which is fundamentally similar to the problem of con-
volutional coding, is covered and new insights are developed through connections
with the mainstream topics of the text.
This book was developed over approximately a dozen years of teaching a
sequence of graduate courses at the University of California, Los Angeles, and later
at the University of California, San Diego, with partial notes being distributed
over the past few years. Our goal in the resulting manuscript has been to provide
the most direct routes to achieve an understanding of this field for a variety of
goals and needs. All readers require some fundamental background in probability
and random processes and preferably their application to communication
problems; one year’s exposure to any of a variety of engineering or mathematics
courses provides this background and the resulting maturity required to start.
Given this preliminary knowledge, there are numerous approaches to utiliza-
tion of this text to achieve various individual goals, as illustrated graphically
by the prerequisite structure of Fig. P-1. A semester or quarter course for the begin-
ning graduate student may involve only Part One, consisting of the first three
chapters (omitting starred sections) which provide, respectively, the fundamental
concepts and parameters of sources and channels, a thorough treatment of channel
models based on physical requirements, and an undiluted initiation into the eval-
uation of code capabilities based on ensemble averages. The advanced student or
Part one
Fundamentals of 1 a 2
digital communication
and block coding
a ee
Part two
Convolutional coding for
digital communication
Part three
Source coding for
digital communication
Pe ne ee See ge, Saye erp e ey eee ey See mains ie) [ie Ret Hunger mete
Introductory <—__|—_> Advanced
|
|
Figure P.1 Organization and prerequisite structure.
PREFACE xiii
specialist can then proceed with Part Two, an equally detailed exposition of
convolutional coding and decoding. These techniques are most effective in ex-
ploiting the capabilities of the channel toward approaching virtually error-free
communications. It is possible in a one-year course to cover Part Three as well,
which demonstrates how optimal source coding techniques are derived essentially
as the duals of the channel coding techniques of Parts One and Two.
The applications-oriented engineer or student can obtain an understanding
of channel coding for physical channels by tackling only Chapters 2, 4, and about
half of 6. Avoiding the intricacies of ensemble-average arguments, the reader
can learn how to code for noisy channels without making the additional effort
to understand the complete theory.
At the opposite extreme, students with some background in digital
communications can be guided through the channel-coding material in Chapters
3 through 6 in a one-semester or one-quarter course, and advanced students,
who already have channel-coding background, can cover Part Three on source
coding in a course of similar duration. Numerous problems are provided to
furnish examples, to expand on the material or indicate related results, and
occasionally to guide the reader through the steps of lengthy alternate proofs
and derivations.
Aside from the obvious dependence of any course in this field on Shannon’s
work, two important textbooks have had notable effect on the development and
organization of this book. These are Wozencraft and Jacobs [1965], which first
emphasized the physical characteristics of digital communication channels as a
basis for the development of coding theory fundamentals, and Gallager [1968],
which is the most complete and expert treatment of this field to date.
Collaboration with numerous university colleagues and students helped
establish the framework for this book. But the academic viewpoint has been
tempered in the book by the authors’ extensive involvement with industrial
applications. A particularly strong influence has been the close association of the
first author with the design team at LINKABIT Corporation, led by I. M. Jacobs,
J. A. Heller, A. R. Cohen, and K. S. Gilhousen, which first implemented high-
speed reliable versions of all the convolutional decoding techniques treated in this
book. The final manuscript also reflects the thorough and complete reviews and
critiques of the entire text by J. L. Massey, many of whose suggested improvements
have been incorporated to the considerable benefit of the prospective reader.
Finally, those discouraged by the seemingly lengthy and arduous route to a
thorough understanding of communication theory might well recall the ancient
words attributed to Lao Tzu of twenty-five centuries ago: “The longest journey
starts with but a single step.”
Andrew J. Viterbi
Jim K. Omura
ao tO’
abiow
44:
.
At
Hari
e%
ef
4
Ee
F8
aa
:
2:
ot
ae
fe
($y
et
ai
ae
ee Sgt be Se
34
a
a ee
es
‘
‘
:
PART
ONE
FUNDAMENTALS OF
DIGITAL COMMUNICATION
AND BLOCK CODING
Ie ee
mae ae
eae
ig
i
CHAPTER
ONE
DIGITAL COMMUNICATION SYSTEMS:
FUNDAMENTAL CONCEPTS
AND PARAMETERS
In the field of communication system engineering, the second half of the twentieth
century is destined to be recognized as the era of the evolution of digital communi-
cation, as indeed the first half was the era of the evolution of radio communication
to the point of reliable transmission of messages, speech, and television, mostly in
analog form.
The development of digital communication was given impetus by three prime
driving needs:
i
Greatly increased demands for data transmission of every form, from computer
data banks to remote-entry data terminals for a variety of applications, with
ever-increasing accuracy requirements
. Rapid evolution of synchronous artificial satellite relays which facilitate world-
wide communications at very high data rates, but whose launch costs, and
consequent power and bandwidth limitations, impose a significant economic
incentive on the efficient use of the channel resources
. Data communication networks which must simultaneously service many differ-
ent users with a variety of rates and requirements, in which simple and efficient
multiplexing of data and multiple access of channels is a primary economic
concern
These requirements and the solid-state electronic technology needed to sup-
port the development of efficient, flexible, and error-free digital communication
3
4 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Input Output ar
Source digital > —----—--—-~-~---~—--------- . digital Destination
sequence | | sequence
Source | | Channel >» _|Channel | | Source |
encoder | encoder rae decoder | decoder
Hea sierts Ure RSAC es: RSS 4
Figure 1.1 Basic model of a digital communication system.
systems evolved simultaneously and in parallel throughout the third quarter of
this century, but the theoretical foundations were laid just before mid-century by
the celebrated “ Mathematical Theory of Communication” papers of C. E. Shan-
non [1948]. With unique intuition, Shannon perceived that the goals of approach-
ing error-free digital communication on noisy channels and of maximally efficient
conversion of analog signals to digital form were dual facets of the same problem,
that they share a common framework and virtually a common solution. For the
most part, this solution is presented in the original Shannon papers. The
refinement, embellishment, and reduction to practical form of the theory occupied
many researchers for the next two decades in efforts which paralleled in time
the technology development required to implement the techniques and algorithms
which the theory dictated.
The dual problem formulated and solved by Shannon is best described in
terms of the block diagram of Fig. 1.1. The source is modeled as a random
generator of data or a stochastic signal to be transmitted. The source encoder
performs a mapping from the source output into a digital sequence (usually
binary). If the source itself generates a digital output, the encoder mapping can be
one-to-one. Ignore for the moment the channel with its encoder and decoder
(within the dashed contour in Fig. 1.1) and replace it by a direct connection called
a noiseless channel. Then if the source encoder mapping is one-to-one, the source
decoder can simply perform the inverse mapping and thus deliver to the destina-
tion the same data as was generated by the source. The purpose of the source
encoder—decoder pair is then to reduce the source output to a minimal representa-
tion. The measure of the “data compression” achieved is the rate in symbols
(usually binary) required per unit time to fully represent and, ultimately at the
source decoder, to reconstitute the source output sequence. This minimum rate at
which the stochastic digital source sequence can be transmitted over a noiseless
channel and reconstructed perfectly is related to a basic parameter of stochastic
sources called entropy.
When the source is analog, it cannot be represented perfectly by a digital
sequence because the source output sequence takes on values from an un-
countably infinite set, and thus obviously cannot be mapped one-to-one into a
discrete set, i.e., a digital alphabet.! The best that can be done in mapping the
source into a digital sequence is to tolerate some distortion at the destination after
' The simplest example of an analog source encoder is an analog-to-digital converter, also called a
quantizer, for which the source decoder is a digital-to-analog converter.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 5
the source decoder operation which now only approximates the inverse mapping.
In this case, the distortion (appropriately defined) is set at a fixed maximum, and
the goal is to minimize the rate—again defined in digital symbols per unit time—
subject to the distortion limit. The solution to this problem requires the generali-
zation of the entropy parameter of the source to a quantity called the rate
distortion function. This function of distortion represents the minimum rate at
which the source output can be transmitted over a noiseless channel and still be
reconstructed within the given distortion.
The dual to this first problem is the accurate transmission of the digital
(source encoder output) sequence over a noisy channel. Considering now only the
blocks within the dashed contour in Fig. 1.1, the noisy channel is to be regarded as
a random mapping of its input defined over a given discrete set (digital alphabet)
into an output defined over an arbitrary set which is not necessarily the same as
the input set. In fact, for most physical channels the output space is often contin-
uous (uncountable)—although discrete channel models are also commonly
considered.
The goal of the channel encoder and decoder is to map the input digital
sequence into a channel input sequence and conversely the channel output se-
quence into an output digital sequence such that the effect of the channel noise is
minimized—that is, such that the number of discrepancies (errors) between the
output and input digital sequences is minimized. The approach is to introduce
redundancy in the channel encoder and to use this redundancy at the decoder to
reconstitute the input sequence as accurately as possible. Thus in a simplistic sense
the channel coding is dual to the source coding in that the latter eliminates or
reduces redundancy while the former introduces it for the purpose of minimizing
errors. As will be shown to the reader who completes this book, this duality can be
established in a much more quantitative and precise sense. Without further evolu-
tion of the concepts at this point, we can state the single most remarkable conclu-
sion of the Shannon channel coding theory: namely, that with sufficient but finite
redundancy properly introduced at the channel encoder, it is possible for the
channel decoder to reconstitute the input sequence to any degree of accuracy
desired. The measure of redundancy introduced is established by the rate of digital
symbols per unit time input to the channel encoder and output from the channel
decoder. This rate, which is the same as the rate at the source encoder output and
source decoder input, must be less than the rate of transmission over the noisy
channel because of the redundancy introduced. Shannon’s main result here is that
provided the input rate to the channel encoder is less than a given value estab-
lished by the channel capacity (a basic parameter of the channel which is a func-
tion of the random mapping distribution which defines the channel), there exist
encoding and decoding operations which asymptotically for arbitrarily long se-
quences can lead to error-free reconstruction of the input sequence.
As an immediate consequence of the source coding and channel coding
theories, it follows that if the minimum rate at which a digital source sequence can
be uniquely represented by the source encoder is less than the maximum rate for
which the channel output can be reconstructed error-free by the channel decoder
6 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
then the system of Fig. 1.1 can transfer digital data with arbitrarily high accuracy
from source to destination. For analog sources the same holds, but only within a
predetermined (tolerable) distortion which determines the source encoder’s mini-
mum rate, provided this rate is less than the channel maximum rate mentioned
above.
This text aims to present quantitatively these fundamental concepts of digital
communication system theory and to demonstrate their applicability to existing
channels and sources.
In this first chapter, two of the basic parameters, source entropy and channel
capacity, are defined and a start is made toward establishing their significance.
Entropy is shown to be the key parameter in the noiseless source coding theorem,
proved in Sec. 1.1. The similar role of the capacity parameter for channels is
partially established by the proof in Sec. 1.3 of the converse to the channel coding
theorem, which establishes that for no rate greater than the maximum determined
by capacity can error-free reconstruction be effected by any channel encoder-
decoder pair. The full significance of capacity is established only in the next two
chapters. Chap. 2 defines and derives the models of the channels of greatest inter-
est to the communication system designer and introduces the rudimentary con-
cepts of channel encoding and decoding. In Chap. 3 the proof of the channel
coding theorem is completed in terms of a particular class of channel codes called
block codes, and thus the full significance of capacity is established.
However, while the theoretical capabilities and limitations of channel coding
are well established by the end of Chap. 3, their practical applicability and manner
of implementation is not yet clear. This situation is for the most part remedied by
Chap. 4 which describes a more practical and powerful class of codes, called
convolutional codes, for which the channel encoding operation is performed by a
digital linear filter, and for which the channel decoding operation arises in a
natural manner from the simple properties of the code. Chap. 5 establishes further
properties and limitations of these codes and compares them with those of block
codes established in Chap. 3. Then Chap. 6 explores an alternative decoding
procedure, called sequential decoding, which permits under some circumstances
and with some limitations the use of extremely powerful convolutional codes.
Finally Chap. 7 returns to the source coding problem, considering analog
sources for the first time and developing the fundamentals of rate distortion
theory for memoryless sources. Both block and convolutional source coding
techniques are treated and thereby the somewhat remarkable duality between
channel and source coding problems and solutions is established. Chap. 8 extends
the concepts of Chap. 7 to sources with memory and presents more advanced
topics in rate distortion theory.
Shannon’s mathematical theory of communication almost from the outset
became known as information theory. While indeed one aspect of the theory is to
define information and establish its significance in practical engineering terms, the
main contribution of the theory has been in establishing the ultimate capabilities
and limitations of digital communication systems. Nevertheless, a natural starting
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 7
point is the quantitative definition of information as required by the communica-
tion engineer. This will lead us in Sec. 1.1 to the definition of entropy and the
development of its key role as the basic parameter of digital source coding.
1.1 SOURCES, ENTROPY, AND THE NOISELESS CODING
THEOREM
“ The weather today in Los Angeles is sunny with moderate amounts of smog” is a
news event that, though not surprising, contains some information, since our
previous uncertainty about the weather in Los Angeles is now resolved. On the
other hand, the news event, “Today there was a devastating earthquake in Cali-
fornia which leveled much of downtown Los Angeles,” is more unexpected and
certainly contains more information than the first report. But what is informa-
tion? What is meant by the “information” contained in the above two events?
Certainly if we are formally to define a quantitative measure of information con-
tained in such events, this measure should have some intuitive properties such as:
1. Information contained in events ought to be defined in terms of some measure
of the uncertainty of the events.
2. Less certain events ought to contain more information than more certain
events.
In addition, assuming that weather conditions and earthquakes are unrelated
events, if we were informed of both news events we would expect that the total
amount of information in the two news events be the sum of the information
contained in each. Hence we have a third desired property:
3. The information of unrelated events taken as a single event should equal the
sum of the information of the unrelated events.
A natural measure of the uncertainty of an event a is the probability of «
denoted P(«). The formal term for “unrelatedness ” is independence; two events a
and f are said to be independent if
P(x © B) = P(a)P(B) (1.1.1)
Once we agree to define the information of an event a in terms of the probability
of a, the properties (2) and (3) will be satisfied if the information in event «a is
defined as
I(x) = —log P(a) (1.1.2)
from which it follows that, if « and B are independent, I(a 7 fB) = —log
P(«)P(B) = —log P(a) —log P(B) = I(a) + 1(B). The base of the logarithm merely
specifies the scaling and hence the unit of information we wish to use. This
8 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
definition of information appears naturally from the intuitive properties proposed
above, but what good is such a definition of information? Although we would not
expect such a simple definition to be particularly useful in quantifying most of the
complex exchanges of information, we shall demonstrate that this definition is not
only appropriate but also a central concept in digital communication.
Our main concern is the transmission and processing of information in which
the information source and the communication channel are represented by prob-
abilistic models. Sources of information, for example, are defined in terms of
probabilistic models which emit events or random variables. We begin by defining
the simplest type of information source.
Definition A discrete memoryless source (DMS) is characterized by its output,
the random variable u, which takes on letters from a finite alphabet
U = {a, az, ..., a4} with probabilities
Pal: ba... (1.1.3)
Each unit of time, say every T, seconds, the source emits a random variable
which is independent of all preceding and succeeding source outputs.
According to our definition of information, if at any time the output of our
DMS is u=a, which situation we shall label as event «,, then that output
contains
I(a,) = —log P(a,) (1.1.4)
units of information. If we use natural logarithms, then our units are called “nats”;
and if we use logarithms to the base 2, our units are called “ bits.” Clearly, the two
units differ merely by the scale factor In 2. We shall use “log” to mean logarithm
to the base 2 and “In” to denote natural logarithm. The average amount of
information per source output is simply
A
H(U) = ¥, P(a)I(%)
k=1
5 Pia) lode (1.1.5)
fe ees 1) ?
H(%) is called the entropy of the DMS. Here we take (0) log (0) = lim «¢ log « = 0.
«70
To establish the operational significance of entropy we require the fundamen-
tal inequality
INnx<x-1 (1.1.6)
* Throughout this book we shall write a variable below the summation sign to mean summation
over the entire range of the variable (i.e., all possible values which the variable can assume). When the
summation is over only a subset of all the possible values, then the subset will also be shown under the
summation.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 9
In x
Figure 1.2 Sketch of the functions In x and x — 1.
which can be verified by noting that the function f(x) = In x — (x — 1) has a
unique maximum value of 0 at x = 1. In Fig. 1.2 we sketch In x and x — 1. For
any two probability distributions P(-) and Q(-) on the alphabet %, it follows from
this inequality that
d P(u) log a == (2) d P(u) In o
< (in 2)" ZY) a - 1
= 4) (1.1.7)
which establishes the inequality
d P(u) log os < » P(u) log = (1.1.8)
with equality if and only if Q(u) = P(u) for all ue &.
Inequalities (1.1.6) and (1.1.8) are among the most commonly used inequali-
ties in information theory. Choosing Q(u) = 1/A for all ue {a,, a), ..., a4} in
(1.1.8), for example, shows that sources with equiprobable output symbols have
the greatest entropy. That is,
0 < H(%) < log A (1.1.9)
with equality if and only if P(u) = 1/A for all ue W = {ay, an, ..., ay}.
10 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Example (Binary memoryless source) For a DMS with alphabet Y = {0, 1} with probability
P(0) = p and P(1) = 1 — p we have entropy
1 1
H(%) = p log — + (1 —p) log —— bits
p ae
= Hp)
where .#(p) is called the binary entropy function. Here #(p) < 1 with equality if and only if
= 4. When p = 4, we call this source a binary symmetric source (BSS). Each output of a BSS
contains 1 bit of information.
Suppose we next let u = (u,, u2,..., Uy) be the DMS output random sequence
of length N. The random variables u,, u,,..., uy are independent and identically
distributed; hence the probability distribution of u is given by°*
N
Pyia) = 17 Pigs (1.1.10)
n=1
where P(-) is the given source output distribution. Note that for source output
sequences u = (u,,U2,...,Uy) € Wy Of length N, we can define the average amount
of information per source output sequence as
H(%y) = ¥. Py(u) log (1.1.11)
Py(u)
As expected, since the source is memoryless, we get
= NH(%) (1.1.12)
which shows that the total average information in a sequence of independent
outputs is the sum of the average information in each output in the sequence.
> We adopt the notation that a subscript on a density or distribution function indicates the
dimensionality of the random vector; however, in the case of a one-dimensional random variable, no
subscript is used. Similar subscript notation is used for alphabets to indicate Cartesian products.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS I1
If the N outputs are not independent, the equality (1.1.12) becomes only an
upper bound. To obtain this more general result, let
I P(u,) | where Pla) => > z -«« ¥ Py(u) (1.1.13)
ifn
Qn(u)
is the first-order probability* of output u,, and Qy(u) # P,y(u) unless the variables
are independent. Then it follows from (1.1.8) that
H(A) = Pyle) los Be
< ¥° Py(u) log a
25) Fiabe
[] Pt,
— NH(%)
where the last step follows exactly as in the derivation of (1.1.12). Hence
H(Uy) < NH(%) (1.1.14)
with equality if and only if the source outputs u,, u,,..., uy are independent; Le.,
the random variables u,, u,,..., uy are the outputs of a memoryless source.
In many applications, the outputs of an information source are either trans-
mitted to some destination or stored in a computer. In either case, it is convenient
to represent the source outputs by binary symbols. It is imperative that this be
done in such a way that the original source outputs can be recovered from the
binary symbols. Naturally, we would like to use as few binary symbols per source
output as possible. Shannon’s first theorem, called the noiseless source coding
theorem, shows that the average number of binary symbols per source output can
be made to approach the entropy of the source and no less. This rather surprising
result gives the notion of entropy of a source its operational significance. We now
prove this theorem for the DMS.
Let u = (uj, uz, ..., Uy) be a DMS output random sequence of length N and
X = (X,, X2, ..., X,,) be the corresponding binary sequence of length [,(u) rep-
resenting the source sequence u. For fixed N, the set of all A‘ binary sequences
(codewords) corresponding to all the source sequences of length N is called a code.
Since codeword lengths can be different, in order to be able to recover the original
* We assume here that this distribution is the same for each output and that
H(%) = —¥ P(u) log P(u)
For generalizations see Prob. 1.2.
12 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
source sequence from the binary symbols we require that no two distinct finite
sequences of codewords form the same overall binary sequence. Such codes are
called uniquely decodable. A sufficient condition for a code to be uniquely decod-
able is the property that no codeword of length / is identical to the first / binary
symbols of another codeword of length greater than or equal to /|. That is, no
codeword is a prefix of another codeword. This is clearly a sufficient condition, for
given the binary sequence we can always uniquely determine the end of a code-
word and no two codewords are the same. Uniquely decodable codes with this
prefix property have the practical advantage of being “instantaneously decod-
able”; that is, each codeword can be decoded as soon as the last symbol of the
codeword is received.
Example Suppose W = {a, b, c}. Consider the following codes for sequences of length N = 1.
UM Code 1 Code 2 Code 3
a 0 00 1
1 01 10
Cc 01 10 100
Code 1 is not uniquely decodable since the binary sequence 0101 can be due to source sequences
abab, abc, cc, or cab. Code 2 is uniquely decodable since all codewords are the same length and
distinct. Code 3 is also uniquely decodable since “ 1” always marks the beginning of a codeword
and codewords are distinct. For N = 2 suppose we have a code
U >» Code 4
aa 000
ab 001
ac 010
ba 011
bb 1000
bc 1001
ca 1010
cb 1011
cc 1100
This code for source sequences of length 2 in W, is uniquely decodable since all sequences are
unique and the first symbol tells us the codeword length. A first “0” tells us the codeword is of
length 3 while a first “1” will tell us the codeword is of length 4. Furthermore this code has the
property that no codeword is a prefix of another. That is, all codewords are distinct and no
codeword of length 3 can be the first 3 binary symbols of a codeword of length 4.
We now proceed to state and prove the noiseless source coding theorem in its
simplest form. This theorem will serve to establish the operational significance of
entropy.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 13
Theorem 1.1.1: Noiseless coding theorem for discrete memoryless sources
Given a DMS with alphabet % and entropy H(%), for source sequences of
length N(N = 1, 2,...,) there exists a uniquely decodable binary code consist-
ing of binary sequences (codewords) of lengths |, (u) for u ¢ W%, such that the
average length of the codewords is bounded by
(Ly> = > Px(u)ly(u)
< N[H(%) + o(N)| (t.1553
where o(N) is a term which becomes vanishingly small as N approaches
infinity. Conversely, no such code exists for which
(Ly) < NH(%)
The direct half of the theorem, as expressed by (1.1.15) is proved by construct-
ing a uniquely decodable code which achieves the average length bound. There are
several such techniques, the earliest being that of Shannon [1948] (see Prob. 1.6),
and the one producing an optimal code, 1.e., the one which minimizes the average
length for any value of N, being that of Huffman [1952]. We present here yet
another technique which, while less efficient than these standard techniques, not
only proves the theorem very directly, but also serves to illustrate an interesting
property of the DMS, shared by a much wider class of sources, called the asymp-
totic equipartition property (AEP). We develop this by means of the following:
Lemma 1.1.1 For any ¢ >0, consider a DMS with alphabet %, entropy
H = H(%), and the subset of all source sequences of length N defined by
SUN, €) =e en Se Pe 2 (1.1.16)
Then all the source sequences in S(N, €) can be uniquely represented by
binary codewords of fixed length Ly, where
N[H(%) + €] < Ly < N(A(W%) + 6€)4+ 1 (1.1.17)
Furthermore
o2
Pr {u ¢ S(N, €)} < Ne (1.1.18)
where
i 2 [—log P(u) — H(%)}?P(u)
Note that all source sequences in the set S(N, ¢) are nearly equiprobable,
deviating from the value 2~%”™) by a factor no greater than 2%*.
14 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
ProoF Since S(N, €) is a subset of %,, the set of sequences of length N, we
have the inequality
1 = )/ Py(u)
yt ages (1.1.19)
uc S(N, €)
Since by definition Py(u) > 2~%''*4 for every ue S(N, e€), this becomes
1 > 3s 2— NIA +]
ue S(N, €)
= 2-MH+4| S(N, «)| (1.1.20)
where |S(N, €)| is the number of distinct sequences in S(N, ¢). This gives us
the bound
| S(N, €)| < 2%ui+4 (1.1.21)
This bound, of course, is generally not an integer, let alone a power of 2.
However, we may bracket it between powers of 2 by choosing the integer Ly
such that
Qiv-1 < QM +d < QIN (1.1.22)
Since there are 2’" distinct binary sequences of length L, , we can represent
uniquely all source sequences belonging to S(N, €) with binary sequences of
length Ly, which satisfies (1.1.17).
Turning now to the probability of the set S(N, ¢), the complementary set
of S(N, €), which consists of all sequences not represented in this manner, let
Fy = Pr {ue S(N, €)}
= et Bin) 3 (1.1.23)
ue S(N, €)
From the definition (1.1.16) of S(N, €), we have
S(N, ¢) = {u: —N[H + €] < log P,(u) < —N[H — «]}
: — NH — Ne < log I] Plu, V< -NH+Ne
n=1
N
u: —Ne< <2) log P(u, ) + NH < Ne
= ie ata > log P(u,) — H| < ( (1.1.24)
| N n=1
Hence the complementary set is,
fas chee |
SEN; €} = . on >, log P(u,) — H| > q (1.1.25)
n=1
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 15
The random variables
Z, = —log P(u,) n=1,2,...,N (1.1.26)
are independent identically distributed random variables with expected value
2 = E]z}
- yp (a,) log P(a,)
= H(%) (1.1.27)
and a finite variance which we denote as
o” = var [z]
From the well-known Chebyshev inequality (see Prob. 1.4) it follows that for
the sum of N such random variables
N
Pr &
| z
| a ie ae >| <a (1.1.28)
Hence for Fy we have
Fy= ) Py(u)
uc S(N, €)
| is
= Pr{u 3, log Ploy) - H] >
| Maes
2
oO
SNe (1.1.29)
Thus F,,, the probability of occurrence of any source sequence not encoded by a
binary sequence of length L, , becomes vanishingly small as N approaches infinity.
Indeed, using the tighter Chernoff bound (see Prob. 1.5) we can show that Fy
decreases exponentially with N. The property that source output sequences
belong to S(N, €) with probability approaching 1 as N increases to infinity is
called the asymptotic equipartition property.
PROOF OF THEOREM 1.1.1 Using the results of Lemma 1.1.1, suppose we add one
more binary symbol to each of the binary representatives of the sequences in
S(N, €) by preceding these binary representatives with a “0.” While this in-
creases the binary sequence lengths from L, to L, + 1, it has a vanishingly
small influence for asymptotically large N. Then using (1.1.17) we have that all
sequences in S(N, €) are represented uniquely with binary sequences of length
1+ Ly < N[H + «] + 2 bits. For all other sequences in S(N, €), suppose these
are represented by a sequence of length 1 + Ly where the first binary symbol
is always “1” and the remaining Ly symbols uniquely represent each se-
16 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
quence in S(N, ¢). This is certainly possible if Ly satisfies
ad ate ers
or N log A<lLy <N log A+1 (1.1.30)
since this is enough to represent uniquely all sequences in W, .
We now have a unique binary representation or codeword for each
output sequence of length N. This code is the same type as Code 4 in
the example. It is uniquely decodable since the first bit specifies the length
(“O” means length 1 + Ly and “1” means length 1 + L,) of the codeword
and the remaining bits uniquely specify the source sequence of length N. If the
first bit is a “0” we examine the next L, bits which establish uniquely a source
sequence in S(N, ¢) while if the first bit is a “1” we examine the next Ly, bits
which establish uniquely a source sequence in S(N, €). Each codeword is a
unique binary sequence and there is never any uncertainty as to when a
codeword sequence begins and ends. No codeword is a prefix of another. The
encoder just described is illustrated in Fig. 1.3.
We have thus developed a uniquely decodable code with two possible
codeword lengths, Ly and L, . The average length of codewords is thus
(Ly> = (1 + Ly) Pr {ue S(N, €)} + (1 + Ly) Pr {ue S(N, €)}
21-n ore (1.1.31)
and it follows from (1.1.17), (1.1.18), and (1.1.30) that
2
(Ly) <1 + N[H(&) +] + 1+ [N log A+ 1] 5
= N|H(% 2.4 flow eA
= (@)+e+— og Nine
or
oe 2 1\ o
eo ag Fe A eit 1.1.32
N ( ee ue + g a )
hee if uc S(N,e)
u=(U,,U2,..., Uy ) gies »XLy ifuEeS( :
50 toric ’ if S(N,
DMS ae Source 1, x1; »XL'n i ue eal
encoder Ly <NIH(U) te] +1
Li, <NlogA +1
Figure 1.3 Noiseless source encoder.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 17
Choosing ¢ = N~'’°, this yields,
ae < H(@)+2N~' + [(log A+ N~')o? + 1]N7*°
= H(%) + o(N) (1.1.33)
which establishes the direct half of the theorem.
Before proceeding with the converse half of the theorem, we note that by
virtue of the asymptotic equipartition property, for large N nearly all code-
words can be made of equal length, slightly larger than NH(%), and only two
lengths of codewords are required.* For small N, a large variety of codeword
lengths becomes more desirable. In fact, just as we have chosen here the length
Ly to be approximately equal to the negative logarithm (base 2) of the almost
common probability of the output sequence of length N where N is large, so it
is desirable (and nearly optimal) to make the codeword lengths proportional
to the logarithms of the source sequence probabilities when N is small. In the
latter case, individual source sequence probabilities are not generally close
together and hence many codeword lengths are required to achieve small
average length. The techniques for choosing these so as to produce a uniquely
decodable code are several (Shannon [1948], Huffman [1952]) and they have
been amply described in many texts. The techniques are not prerequisites to
any of the material presented in this book and thus they will not be included
in this introductory chapter on fundamental parameters (see, however,
Prob. 1.6).
To prove the converse, we must keep in mind that in general we may have
a large variety of codeword lengths. Thus for source sequence u € Wy we have
a codeword x(u) which represents u and has length denoted /,,(u). The lengths
of the codewords may be arbitrary. However, the resulting code must be
uniquely decodable. For an arbitrary uniquely decodable code we establish a
lower bound on <Ly >.
Consider the identity
( aly a (x zeny ey (yr as
uM
ok 2 y mee > 2 — Un(ar) + In(a2) + + + Iv(am)) (1.1.34)
uM
a; 42
where each sum on both sides of the equation is over the entire space W, . If
we let A, be the number of sequences of M successive codewords having a
° If errors occurring with probability Fy could be tolerated, all codewords could be made of equal
length. While this may seem unacceptable, we shall find in the next chapter that in transmission over a
noisy channel some errors are inevitable; hence if we can make F, smaller than the probability of
transmission errors, this may be a reasonable approach.
18 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
total length of k binary symbols, (1.1.34) can be expressed as
M Mi
( so) =>; Aga* (1.1.35)
u k=1
where /* = max, /,(u). But in order for the source sequences to be recoverable
from the binary sequences we must have
A,<2* k=1,2,..., Mig (1.1.36)
Otherwise two or more sequences of M successive codewords will give the
same binary sequence, violating our uniqueness requirement. Using this
bound for A,, we have
M Mik
( xn) =) te Mie (1:41.37)
u k=1
for all integers M. Clearly this can be satisfied for all M if and only if
y ores (1.1.38)
for the left side of (1.1.37) behaves exponentially in M while the right side
grows only linearly with M. This inequality is known as the Kraft-McMillan
inequality (Kraft [1949], McMillan [1956]).
If we were now to use the general variable length source encoder whose
code lengths must satisfy (1.1.38) we would have an average of
(Ly) = Y Py(w)ly(u) (1.1.39)
binary symbols per source sequence. Defining on ue W, the distribution
Qy(u) = ees (1.1.40)
ps 2~ tw’)
we have from inequality (1.1.8) and (1.1.12)
,
u
> — l(a)
3 ca
oy >, Py(u) log
= ¥ Py(u) tog (2-8) + Y Pytuy(e
= (Ly) + log (> ge (1.1.41)
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 19
Since the Kraft-McMillan inequality (1.1.38) guarantees that the second term
is not positive we have
NH(%) < (Ly (1.1.42)
This bound applies for any sequence length N and it follows that any source
code for which the source sequence can be recovered from the binary se-
quence (uniquely decodable) requires at least an average of H(%) bits per
source symbol.
This completes the proof of Theorem 1.1.1 and we have thus shown that it is
possible to source encode a DMS with an average number of binary symbols per
source symbol arbitrarily close to its entropy and that it is impossible to have a
lower average. This is a special case of the noiseless source coding theorem of
information theory which applies for arbitrary discrete alphabet stationary ergod-
ic sources and arbitrary finite code alphabets (see Prob. 1.3) and gives the notion
of entropy its operational significance. If we were to relax the requirement that the
source sequence be recoverable from the binary-code sequence and replaced it by
some average distortion requirement, then of course, we could use fewer than
H(%) bits per source symbol. This generalization to source encoding with a distor-
tion measure is called rate distortion theory. This theory, which was first pre-
sented by Shannon in 1948 and developed further by him in 1959, is the subject of
Chap. 7 and Chap. 8.
Another important consequence of the theorem is the asymptotic equality of
the probability of source sequences as N becomes large. If we treat these sequences
of length N as messages to be transmitted, even without considering their efficient
binary representation, we have shown that the “typical” messages are asymptot-
ically equiprobable, a useful property in subsequent chapters where we treat
means of accurately transmitting messages over noisy channels.
1.2 MUTUAL INFORMATION AND CHANNEL CAPACITY
Shannon demonstrated how information can be reliably transmitted over a noisy
communication channel by considering first a measure of the amount of informa-
tion about the transmitted message contained in the observed output of the chan-
nel. To do this he defined the notion of mutual information between events « and
B denoted I(a; B) which is the information provided about the event a by the
occurrence of the event B. As before the probabilities P(«), P(B), and P(a - B) are
assumed as given parameters of the model. Clearly to be consistent with our
previous definition of information we must have two boundary condition
properties:
1. If « and f are independent events (P(a ~ £) = P(«)P(B)), then the occurrence
of B would provide no information about «a. That is, I(a; B) = 0.
20 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
2. If the occurrence of f indicates that « has definitely occurred (P(a |B) = 1), then
the occurrence of f provides us with all the information regarding «a. That is,
I(x; B) = I(x) = log [1/P(a)].
These two boundary condition properties are satisfied if the mutual information
between events « and f is defined as
P(«|B)
P(a)
= log i (1.2.1)
P(«)P(B)
Note that this definition is symmetric in the two events since I(a; B) = I(B; a).
Also mutual information is a generalization of the earlier definition of the infor-
mation of an event « since I(x) = log [1/P(«)] = I(a; a). Hence I(a) is sometimes
referred to as the self-information of the event «. Note that although J(«) is always
nonnegative, mutual information I(«; 8) can assume negative values. For example,
if P(«|B) < P(«) then I(«; 8) <0 and we see that observing B makes « seem less
likely than it was a priori before the observation.
We are primarily interested in the mutual information between inputs and
outputs of a communication channel. Virtually all the channels treated through-
out this book will be reduced to discrete-time channels which may be regarded as
a random mapping of the random variable x,,, the channel input, to the variable
y,, the channel output, at integer-valued time n. Generally these random variables
will be either discrete random variables or absolutely continuous random var-
lables. While only the former usually apply to practical systems, the latter also
merit consideration in that they represent the limiting case of the discrete model.
We start with discrete channels where the input and output random variables are
discrete random variables. Generalizations to continuous random variables or a
combination of a discrete input random variable and a continuous output random
variable is usually trivial and requires simply changing probability distributions
to probability densities and summations to integrals. In Chap. 2 we shall see how
these various channels appear in practice when we have additive white Gaussian
noise disturbance in the channel. Here we begin by formally defining discrete
memoryless channels.
I(x; B) = log
Definition A discrete memoryless channel (DMC) is characterized by a discrete
input alphabet 2%, a discrete output alphabet %, and a set of conditional
probabilities for outputs given each of the inputs. We denote the given condi-
tional probabilities® by p(y|x) for y ¢ Y and x € &. Each output letter of the
channel depends only on the corresponding input so that for an input se-
© Throughout the book we use lowercase letters for both probability distributions and probability
densities associated with channel input and output random variables.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 21
Genet 2
Oe 2 0
xe {0, I} ye {0, 1}
l-p Figure 1.4 Binary symmetric channel.
quence of length N, denoted x = (x,, x2, ..., Xy), the conditional probability
of a corresponding output sequence, denoted y = (y,, yz, ..., yy), may be
expressed as’
pxty |x) = [] pel) (1.2.2)
This is the memoryless condition of the definition. We define next the most
common type of DMC.
Definition A binary symmetric channel (BSC) is a DMC with ¥ = ¥ = {0, 1}
and conditional probabilities of the form
p(1|0) = p(0|1) =p
p(0|0) = p(1|1)=1—p (1.2.3)
This is represented by the diagram of Fig. 1.4.
We can easily generalize our definition of DMC to channels with alphabets
that are not discrete. A common example is the additive Gaussian noise channel
which we define next.
Definition The memoryless discrete-input additive Gaussian noise channel is a
memoryless channel with discrete input alphabet = {a,, a>,..., ag}, output
alphabet Y = (— 00, 00) and conditional probability density
—(y— ax)?/202
P(y | a,) * 5e for all ye Y (1.2.4)
TO
where k = 1, 2, ..., Q.
This is represented by the diagram of Fig. 1.5 where n is a Gaussian random
variable with zero mean and variance o”. For this case, memoryless again means
’ This definition is appropriate when and only when feedback is excluded; that is, when the
transmitter has no knowledge of what was received. In general, we would require HI RT,
-++> YVa-1) = Ply, |X,) for all n.
22 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
.
’
ax at uD bt Figure 1.5 Additive Gaussian noise channel.
that for any input sequence x of length N and any corresponding output sequence
y we have
Py(y |x) = nl P(Yn| Xn) (1.2.2)
for all N. These and other channels will be discussed further in Chap. 2. In this
chapter we examine only discrete memoryless channels.
Consider a DMC with input alphabet %, output alphabet ¥Y and conditional
probabilities p(y|x) for ye Y, x e X. Suppose, in addition, that input letters
occur with probability q(x) for x « 2. We can then regard the input to the channel
as a random variable and the output as a random variable. If we observe the
output y then the amount of information this provides about the input x is the
mutual information
ce v¥c Ing CUS
es aoe p(y)
se q(x|y)
= log ae) (1.2.5)
where p(y) = ¥ p(y|x)q(x) (1.2.6)
As with sources, we are primarily interested in the average amount of information
that the output of the channel provides about the input. Thus we define the
average mutual information between inputs and outputs of the DMC as®
(2; Y) = E[I(x; y)]
p(y |x)
pty |x)q) —
= X D p(y | x)q(x) log 5
The average mutual information I(2; Y) is defined in terms of the given channel
conditional probabilities and the input probability which is independent of the
® Actually the definition is not restricted to channel inputs and outputs. It is the appropriate
definition for the average mutual information between an arbitrary pair of random variables. For
absolutely continuous random variables we replace summations and probabilities by integrals and
density functions.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 23
DMC. We can then maximize I(%; %) with respect to the input probability
distribution q = {q(x): x € 2}.
Definition The channel capacity of a DMC is the maximum average mutual
information, where the maximization is over all possible input probability
distributions. That is,
C = max I(%; Y) (1.2.8)
q
Example (BSC) By symmetry the capacity for the BSC with crossover probability p, as shown in
Fig. 1.4, is achieved with channel input probability q(0) = g(1) = 4. Hence
C=I(%;%)
q(0)=q(1)=1/2
1 1
mip eg — -- (+p) ies
p oat
= 1 — #(p) bits/symbol
As expected when p = 5 we have #(5) = 1 and C = 0. With p = 0 we get #(0) = Oand C = 1 bit
which is exactly the information in each channel input. Note that we also have C = 1 when p = 1
since from the output symbol, which is the complement of the input binary symbol, we can
uniquely determine the input symbol. By extending this argument, it follows that
C(p) = C(1 — p).
Note that channel capacity is defined only in terms of given channel charac-
teristics. Even though these are assumed given, performing the maximization to
find channel capacity is generally difficult. Maximization or minimization of func-
tions over probability distributions can often be evaluated with the aid of the
Kuhn-Tucker theorem (see App. 3B). In Chap. 3 we shall find necessary and
sufficient conditions on the input probability assignment that achieves capacity as
well as for the maximization of other functions that arise in the analysis of digital
communication systems. (In App. 3C we also give a simple computational algor-
ithm for evaluating capacity.) We shall see that, like the entropy parameter for a
source, the capacity for a channel has operational significance, related directly to
limitations on the reliable transmission of information through the channel. First,
however, we examine some properties of average mutual information, which will
be useful later.
Lemma 1.2.1
O<(%5 Y) <> Dd Ply|x)al x) log ply|x) (1.2.9)
where p(-) is any pecs Aa any Equality is achieved in the upper
bound if and only if p(y) = 2, a(x) p(y|x) for all ye Y. (2; Y) =0 if
and only if the output random ia is independent of the input random
variable.
24 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Proor? The lower bound is found by using the inequality In x <x — 1 as
follows:
P(y)
p(y |x)
= (In 2)” ‘LY ply xdatx) Py)
(2%; Y¥) => > Ply|x)a(x) log
Ye
p(y |x)
< (In 2)" DD ply|x)q(x) Pe = 1
= (In 2)7 ” > Plya(x) — > ply |x)a(x)
= (1.2.10)
with equality to zero if and only if p(y|x) = p(y) for all y and x.
The upper bound to 1(%; ¥Y) follows from the form
1 1
= 2 Ply) ae — > >» ply|x)q(x) log rome (1.2.11)
It follows from (1.1.8) that
with equality if and only if p(y) = p(y) for all y e Y. Substituting this inequa-
lity for the first term of (1.2.11) yields the desired result.
Consider a sequence of input random variables of length N denoted
X = (X1, X2,..., Xy). Let the probability of the input sequence be given by qy(x)
for x € #, and let the resulting marginal probability of x, be q(x) for xe &
where n= 1, 2, ..., Ni That is
q(x peal todsot ds qn (x
(i on)
for each n. The average mutual information between input sequences of length N
and the corresponding output sequences of length N is
y|x
(4x; 9x) = 2, 2 Pal y |x)qn(x ) log * uly |x) (1.2.12)
where py(y) = ¥; pw(¥]x)an(x)
x
° Although the properties given here hold for any logarithm base we shall prove properties for base
2. Generalization to any base is trivial.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 25
Since the channel is memoryless, the average mutual information between x, and
the corresponding output yj, is
PAF D= >) wy|x)q™(x ) log PUI) (1.2.13)
y x p(y)
where p(y) = >, p(y|x)q™(x) and n = 1, 2,..., N. We then have
Lemma 1.2.2
N
I(@y; Yn) < 7 Te Y)< NC (1.2.14)
n=1
where equality is achieved in the lower inequality when (but not only when)
X14, X2,..., Xy are independent random variables and in the upper inequality
when and only when each independent input random variable has the proba-
bility distribution that achieves channel capacity.
PrRooF From Lemma 1.2.1 we have
x
(Hy; Yx) < YY Pw(y|X)an(x) log Pal (1.2.15)
a ie Py(y)
for any probability distribution p,(-). Now choose
N
y)= |] P'0,) (1.2.16)
n=1
Then since
Px(y|X) PY P(vn| Xn)
: = OPA
ix) LL DG, Be
we have
N
n Xn
I(2y; By) < TY pal |x)an(x) 2 log PUVn|
ee Sew p Yn)
N
x
= Y LY plr|x)a(x) log 2
n—i> y. x p (y)
N
a Fe (1.2.18)
with equality if and only if p,(y) = py(y) for all ye Y,. Equality is thus
achieved when the output random variables y,, y2, ..., yy are independent.
Since the channel is memoryless, this certainly happens if the input random
variables x,, X2,..., Xy are independent. The upper. inequality follows trivially
since I(%; Y) < C with equality, according to (1.2.8), when and only when
the input probability distribution g™(-) achieves the maximum average
mutual information.
26 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
x y V
——>| DMC! >| DMC2 >
x Y VU Figure 1.6 Cascade of channels.
Lemma 1.2.3 Consider three random variables x, y, v which have joint prob-
ability p(x, y, v) forxe #, ye Y, ve VY. Let average mutual information be
defined for each pair of random variables, I(%; Y), 1(#; W), and I(Y; W).
Assume, as shown in Fig. 1.6, that x is an input random variable to one
channel with output y which in turn becomes the input to a second channel
with output random variable v. Assume further that, as implied by Fig. 1.6,
p(v|x, y) = p(v|y) (1.2.19)
which means that x influences v only through y. Then 1(%; Y), I(2; ), and
I(Y; ¥) are related by inequalities
4:V)<(%;%) (1.2.20)
and
(*;V)<\(%:V) (1.2.21)
PROOF
gee ees 5 ey agate MEI . p(y |x)
He. 7) 8) ) log i) 2, 2, Pl y) log ae
p(v | x)p(y)
“LLLP sapliiie p(v)p(y |x)
p(v|x)p(y)
LLL rls yo) a yx)
ea x yp) (Pe bdPO) _
< (In 2) 2 2 2 Pls ¥ ) ae 1 (1.2.22)
where we have again used In x < x — 1. Note further that by Bayes’ rule
p(x, y, v)p(v|x)p(y) _ P(x, y, v)p(x, v)p(y)
P(v)p(y | x) P(v)p(x; y)
_ P(v| x; y)p(x, v)ply)
p(v)
= p(v|x, y)p(x|v)p(y)
= p(v| y)p(x|v)p(y) (1.2.23)
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 27
u Xx y Vv
> Encoder > DMC =< Decoder e-
Uy Xy Yy U;
Figure 1.7 Data processing system.
where in the last equality we used the hypothesis (1.2.19). Hence combining
(1.2.22) and (1.2.23)
Was 7)— 1a: 9) < (m2 {55 nt »)o6 eo) -1|
;
|
a | (1.2.24)
The second inequality follows from a similar argument.
Lemma 1.2.3 can be generalized easily to various length sequences in a
cascade of devices. A special case of the second DMC in Fig. 1.6 is a deterministic
device, that maps input y into output v deterministically. Next consider Fig. 1.7
where we assume that u is a sequence of length L of random variables with
probability p,(u) for ue W%, which generates the inputs to a deterministic device
called an encoder whose output sequence x is of length N. The sequence x is then
the input to the DMC for which, by definition
N
Pr(y|x)= [] pvalx,) foranyxe ®y and ye Wy
n=1
Finally y is the input to a deterministic device called a decoder whose output is v, a
sequence of length L. The encoder can be assumed to operate on the entire L
length sequence u to generate the N length output sequence x. Similarly the
decoder can be assumed to operate on the entire N length sequence y to output
the L length sequence v. Regarding sequences as single inputs and outputs we have
from Lemma 1.2.3 the inequalities
I(U,;V,) < 1(U,; Fn) (1.2.25)
and
(U3; Vy) < WAn; Fy) (1.2.26)
Combining these we obtain the data-processing theorem:
Theorem 1.2.1: Data-processing theorem For the system of Fig. 1.7
HG ¥ JAS a: Ot (1.2.27)
This result assumes that each sequence influences subsequent sequences
as shown in Fig. 1.7. That is u influences v only through x, which in turn
28 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
influences v only through y so that p,;(v|u, x, y) = p,(v|y) where y ¢ %, and
ve ¥_. Also, from Lemma 1.2.2, we obtain the result that for the system of
Fig. 1.7
I(%,;V 1.) < NC (1.2.28)
where C is the channel capacity of the DMC.
The above properties of average mutual information follow easily from simple
inequalities and definitions. Even though mutual information can be negative-
valued, the average mutual information cannot be negative. Furthermore, the
average mutual information between outputs and inputs of a DMC is nonnegative
and becomes zero only when the outputs are independent of the inputs. Thus it is
not surprising to find that by cascading more devices between inputs and outputs
the average mutual information decreases, for the insertion of each additional
device weakens the dependence between input and output. Other properties of
average mutual information are given in App. 1A. Although these properties of
average mutual information are discussed in terms of “channels” they apply to
more general situations. For example, the “data-processing theorem” applies
even when the encoder, channel, and decoder in Fig. 1.7 are replaced by arbitrary
“data processors.”
To show the significance of the definition of mutual information, average
mutual information, and channel capacity, we examine the problem of sending the
outputs of a source over a communication channel. We shall show that if the
entropy of the source is greater than the capacity of the channel, then the com-
munication system cannot operate with arbitrarily small error no matter how
complex the coding system. This negative result is called the converse to the
coding theorem.
1.3 THE CONVERSE TO THE CODING THEOREM
Let us now examine the problem of sending the outputs of a discrete memoryless
source (DMS) to a destination through a communication channel modeled as a
discrete memoryless channel (DMC). Specifically, consider the block diagram of
Fig. 1.8 where the DMS alphabet is W@ = {ay, az, ..., a4}, with probability distri-
bution P(-) and entropy H(W). We assume that source outputs occur once every T;
seconds so that the DMS average information output rate is H(W)/T, bits per
second, when H(%) is measured in bits per output. The destination accepts letters
belonging to the same alphabet, YW = %, at the same source rate of one symbol
every 7, seconds.
The DMC has input alphabet 2, output alphabet Y, and conditional probabil-
ities p(y|x) for ye Y, x € &. It also has a channel capacity of C bits per channel
use, when mutual information is measured in bits. We assume that the channel is
used once every T, seconds.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 29
u=(u,,U2,-..,Uz,) BAX Nay 6.0%, Xy)
DMS > Encoder
uy EU x, EX
vy
DMC
¥ HAD, Us. .. MD v= GY, Vass .5- Ns
Destination | Decoder |}
Up EV ¥eEY
Figure 1.8 A communication system.
We are now dealing with a DMS that outputs a symbol once every T, seconds
and a DMC that can be used once every TJ, seconds. Without compromising
notation, we can continue to label source outputs and channel inputs with integer
indices. We merely adopt the convention that the source output u,; occurs at time
IT, and x, is the channel input at time nT, + T, where T, is the encoding delay.
We assume that the DMS and DMC are given and are not under our control.
The encoder and decoder, on the other hand, can be designed in any way we
please. In particular, the encoder takes source symbols and outputs channel input
symbols while the decoder takes channel output symbols and outputs symbols
belonging to the source alphabet Y = WY. Suppose now we wish to send to the
destination L source output symbols, u. The encoder then sends N channel input
symbols, x, over the channel where we assume that
LT, = NT. (1.3.1)
Each channel input symbol can depend on the L source symbols, u, in any way
desired. Similarly the decoder takes the N channel output symbols y, and outputs
a sequence of L destination symbols, y. Again each destination symbol can depend
on the N channel output symbols, y, in any way desired. The channel is mem-
oryless so that for each time nT, + T, the channel output symbol y, depends only
on the corresponding channel input symbol x,.
In any communication system of this type we would like to achieve very small
error probabilities. In particular, we are interested in the probability of error for
each source letter, as defined by
P.., = Pr {v, 4 u}
> > Poly. 2) (1.3.2)
uovFu
for 1 = 1, 2,..., L. Here P(u, v) is the joint probability distribution of v, and u,.
P, , is the probability that the /th source output u, is decoded incorrectly by the
30 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
destination. The average per digit error probability, (P,), over the L source
outputs is defined as
P= 2, (1.3.3)
For most digital communication systems (P,) is the appropriate performance
criterion for evaluating the system. If (P,) can be made arbitrarily small we have
a reliable communication system. We proceed to show that if the source entropy is
greater than channel capacity, reliable communication is impossible. This result is
known as the converse to the coding theorem.
We begin by considering the difference between the entropy of the source
sequence, H(W,), and the average mutual information between the source se-
quence and the destination sequence, I(W,; Y'_). From the definitions and Bayes’
rule, it follows that
ulv)
1
te ie P l 1.3.4
d L(Y) d i(u|y) og P,(u| ) ( )
Next we apply the inequality (1.1.8) to get the bound
SP faledos 9 © oe Baling (13.5)
u P, (uv) u P, (uly)
for any conditional probability P,(u|v). Let us now choose
L
Pi (uly) = |] P(u,|2,) (1.3.6)
l=1
where
POU), v;)
P®(u,| v;) — ~ POHy) (1.3.7)
and
PO, = > Pu, v,) (1.3.8)
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 31
This choice in (1.3.4) and (1.3.5) yields the bound
H(M,) — 1A; V1) <¥ Pulv) |F Py(u]v) log |
L
1
= P(u, v) |
x uoovFu “ ») i P(u |v)
bss
po l 1.3.9
LPN 108 BaeTo)| Eve
We now consider the two parts in this bound separately using again the fun-
damental inequality In x < x — 1 and the relationship
P., => ¥ Pu, v) (1.3.2)
uoov Fu
from which it readily follows that
1—-P.,.=), ) Pu »)
22.
= ¥ Pp, v) (1.3.10)
We bound the first term in the brace in (1.3.9) as follows using (1.3.2):
>, >, Pu v) We aS
uovFu
a (1) ] e,
E DPM 0) tos (porto) a Pe
= Y P%u, v) log ——“*!
u vu (A es 1)P°(u |v)
—1
+) > Pu, v) log
uoov Fu e,l
P
oe —1 (1) es
fn 2 BP PHU)
+P, ; log 4c
e,l
32 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Py
a > > F', v) ia “s 1)P(u|v) os |
uovFu
+ P, , log
e,l
Ae poe) Pe FY Pou 0)
uoovfu uov#u
A-1
+ P, , log
el
“HSS POv) — -P..|
v uv
A-1
+ P, , log
e,l
A-—1
he
= Ps log (1.3.11)
The second term is bounded in a similar manner as follows using (1.3.10):
1 1 1-—P
@ ] P® )1 es
Ze (v, v) log POv]o) Iv) = (v, v) log fac "EE =)
age
= » P(v, v) log Pv] p)
+ ¥ Pv, v) log
v = | et
oe ee
: P®(v|v)
l
+4
Peta
1—P,,;
= (in 2)- Hy Po POe\(L — P.) —¥ PM 9)
+ (1 — P, ;) log
+ (1 — P,, ;) log
somay+ Ete flee
+ (1 — P,. ;) log
1—P,,
= (1—P, ;) log (1.3.12)
1—P.;
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 33
Recalling from Sec. 1.1 the definition of the binary entropy function
1 1
H (p) = p log . + (1 — p) log = (1.3.13)
and the definition (1.3.3) of <P,> and using the bounds (1.3.11) and (1.3.12) in
(1.3.9), we obtain
A-|1
H(@,) —1(@,;V) < y |p P, , log + (1 — P,, ,) log
l=1 - Pi i<f.;
= L(P,) log (A — 1) + y (P...;) (1.3.14)
The next to final form of the desired inequality follows from the observation that
from (1.1.8) we have
1
P.., log 5 +(T— P, }) log
e,l
L=—f,,
opt A =2,,) Jos IE5 (1.3.15)
so that
> #(P..,)< y [Pe log om +(1-P, ,) log rc
= L#(<P,)) (1.3.16)
Hence
H(W,) — W(t; V1) < LAP.) log (A — 1) + L#(XP.) (13.17)
Since the source is memoryless, from (1.1.12) we have
H(%,) = LH(%) (1.3.18)
Furthermore, Theorem 1.2.1, Lemma 1.2.2, and (1.3.1) give us
i
UU; V1) < (Fy; Yy) < NC = = LC (1.3.19)
c
Using (1.3.18) and (1.3.19) in (1.3.17) yields the desired bound
Tyo < H(t) — W(t 1)
7 L
< <P.) log (A — 1) + #(XP.)) (1.3.20)
H(%) —
lA
For convenience in using the upper bound of (1.3.20), we define
F(<P.)) = (P.) log (A — 1) + #((P-))
34 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
F(<P.>)
log A |
log (A — 1)
B
<Pe>
0 a 4824
A
Figure 1.9 F((P,)) = (P,) log (A — 1) + #(P,).
According to (1.3.20), if the source entropy of H(%)/T, bits per second is
greater than the channel capacity of C/T, bits per second, then F((P,)) =
(P.) log (A— 1) + #((P.)) is greater than the constant ~ = H(%) —
(T,/T.)C > 0. Figure 1.9 shows F({P,) as a function of (P,). From this it is clear
that if 8B > 0 then there exists some a > 0 such that (P,) > a. Note that this holds
regardless of the source sequence length L, and hence yields the following form of
the converse theorem due to Fano [1952].
Theorem 1.3.1 (Converse to the Coding Theorem) If the entropy per second,
H(%)/T,, of the source is greater than the channel capacity per second, C/T,,
then there exists a constant « > 0 such that (P,) > a for all sequence lengths.
The converse to the coding theorem shows that it is impossible for a commun-
ication system to operate with arbitrarily small average error probability when the
information rate of the source is greater than channel capacity. We shall see in
subsequent chapters that if the information rate is less than channel capacity, then
there are ways to achieve arbitrarily small average error probability. These results
give the concepts of mutual information and particularly channel capacity their
operational significance.
1.4 SUMMARY AND BIBLIOGRAPHICAL NOTES
In this introductory chapter we have presented the basic concepts of information
and its more general form, mutual information. We have shown that for a discrete
memoryless source the average amount of information per source output, called
entropy, represents the theoretical limit on the minimum average number of
binary symbols per source output necessary to represent source-output sequences.
This result generalizes to discrete stationary ergodic sources (see Prob. 1.3) and
more general code alphabets. Next we defined discrete memoryless channels
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 35
which serve as models for many real noisy communication channels. The maxi-
mum average mutual information of a discrete memoryless channel, called chan-
nel capacity, represents the theoretical limit on the rate of information that can be
reliably transmitted over the channel. In this introductory chapter we have proved
for discrete memoryless channels the negative part of this result, commonly called
the converse to the coding theorem. This result generalizes easily to all memoryless
channels.
The theoretical foundations of digital communication were laid by C. E.
Shannon [1948]. Most of the concepts of this chapter are found in greater gener-
ality in this original work. Other similar treatments can be found in Fano [1961],
Abramson [1963], Gallager [1968], and Jelinek [1968a]. The books of Feinstein
[1958], Wolfowitz [1961], and Ash [1965] may appeal to those who prefer math-
ematics to engineering applications.
APPENDIX 1A CONVEX FUNCTIONS
In this chapter we defined two fundamental parameters of information theory:
H(%), the entropy of an information source, and I(%; Y), the average mutual
information between the inputs and outputs of a communication channel. These
are two examples of a more general class of functions which have the property
known as convexity. In this section we briefly examine convex functions and some
of their properties. These results will be useful throughout the rest of this book.
Definition A real-valued function f(-) of a real number is defined to be
convex over an interval ¥ if, for all x, € %, x, € %, and 6,0 < 6 < 1, the
function satisfies
Of (x1) + (1 — O)F (x2) < f[Ox, + (1 — 8)x2] (1A.1)
If the inequality in (1A.1)is reversed for all such x,, x,, and @ then f(-) is called
convex U. When (1A.1) or its converse is a strict inequality whenever x, # x,
then we call f(-) strictly convex - or strictly convex vu.
In Fig. 1A.1 we sketch a typical convex ~ function for fixed x, and x, as
a function of 6. From this it is clear why the - (cap) notation is used here.'®
Similar comments apply to convex U (cup) functions. In fact since a convex U
function is the negative of a convex q function, we need only examine the prop-
erties of convex functions. Commonly encountered convex functions are
In x and x? (0 < p < 1) for the interval .% = (0, oo). Convex u functions include
10 In the mathematical literature a convex © function is called concave and a convex vu function
convex. Gallager [1968] introduced the notation used here to avoid the usual confusion associated with
the names concave and convex.
36 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
fl@x, +(1—-6)x,]
f(x)
f(x2) q
|
|
|
|
|
Of(x;)+(1-O)fx2) |
|
]
Figure 1A.1 A convex 4 function.
—Inx and x’ (p > 1) for ¥ = (0, o). Functions that are both convex U and
convex ™ are linear functions of the form ax + b.
Sometimes with more complex functions it is difficult to tell whether or not a
function is convex >. A useful test is given next.
Lemma 1A.1 Suppose f(-) is a real-valued function with derivatives f’(-) and
f’(-) defined on an interval .%. Then f(-) is a convex 4 function over interval
JF if and only if
f"(x)<0 = forallxe g (1A.2)
PROOF Let x,, x2, and y be any set of points in ¥. Integrating f”(-) twice,
we have
bs a x4
{| £'@) dx ap={ [f(6)-F'O) a8
=f (x1) -f(y) —f' WO) — 7] (1A.3)
and
a7 8
[| £7) de dB = f(x2) — f(y) -f'O)a — 7 (1A.4)
Y Y
For any @ € (0, 1) we combine these equations to obtain,
Of (x1) + (1 — 0) f (x2) — f(y) —F'(y)[Ox1 + (1 — @)x2 — y]
eae > ly
= 6 | | f(a) da dB + (1 — 6) | | f'(a) da dB (1A.5)
Y 3 Y es
Now choosing y = 0x, + (1 — @)x, we see from (1A.5) that
Of (x1) + (1 — O)F (x2) < f[Ox, + (1 — 8)x2]
for all x, and x, in ¥ and 6 € (0, 1) if and only if (1A.2) is true.
We proceed to define convex functions of several variables, but first we need
to define a convex region in a real vector space. Let #y be the set of N-dimensional
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 37
real vectors. We define a region .%£, < &,y to be a convex region if for each vector
x, € £y and each vector x, €.%,, the vector Ox, + (1 — @)x, is in .¥, for all
6 €(0, 1). This means that for a convex region all points connecting any two
points in the region also belong to the region. The convex region most often
encountered in this book is A, the set of probability vectors. Formally,
N
Poe teox > On 5,2. Nees (14.6)
n=1
Definition A real-valued function f(-) of vectors of dimension N is defined to
be convex over a convex region £,y if, for all x, € %,, x, € Fy, and 8,
0 < @ < 1, the function satisfies
Of (x1) + (1 — 9) F (x2) < f[Ox1 + (1 — 8)x2] (1A.7)
If we have a strict inequality whenever x, + x, then f(-) is called strictly
convex -. The function is convex u if the inequality is reversed.
For convex - functions of vectors we have two important properties:
1. If f(x), fo(x), ..., f(x) are convex - functions and if c,, cz, ..., c, are positive
numbers, then
L
Yc Silx) (1A.8)
l=1
is convex 4 with strict convexity if any of the { f,(x)} are strictly convex 4. This
follows immediately from the definition given in (1A.7).
2. Let x be a random vector of dimension N and let f (x) be any convex - function
of vectors of dimension N. Then
E[ f(x)] < f (E[x]) (1A.9)
where E[-] is the expectation. This very useful inequality, known as the
Jensen inequality, is proved in App. 1B.
The entropy function
H(x) = yx, In - (1A.10)
is a convex q function over Py defined by (1A.6). To see this let
1
fu(X) = Xn In —— for... = A, 2g WN
By using ae 1A.1 we i that each f,(x) is convex m. Then by property 1 we
have that H(x) = )*_, ) is also convex ~. Another proof can be obtained
directly sh Ut a : 18), (See Prob. 1.12.)
38 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Finally suppose we consider a DMC with input alphabet %, output alphabet
Y, and transition probabilities p(y|x) for x e &, y € Y. For an input probability
distribution q(x) for x € 2, we defined average mutual information as
120) = TY prise) low (12.
To emphasize the dependence of I(#; %) on the transition probabilities repre-
sented by P and the input probability distribution represented by q we write this
as
I(X; Y) = I(q; P) (1A.11)
Lemma 1A.2 [(%; Y) for fixed channel transition probabilities is a convex A
function over the input probability space, and for fixed input probability
distribution a convex U function over the channel transition probability
space. That is
OI(q; P) + (1 — @)I(q®; P) < 1(6q’ + (1 — Og’; P) = (14.12)
where 0q‘’) + (1 — 0)q‘”’ represents the probability distribution 6q‘(x) +
(1 — 0)q'”(x), x € 2, for any input probability distributions q"? and q‘?) and
for all 6 € (0, 1)
OI(q; P™) + (1 — @)I(q; P™) = I(q; OP + (1 — 0)P™) = (1A.13)
where OP“) + (1 — 6)P) represents the transition probabilities Ap (y|x) +
(1 — 0)p(y|x) for ye Y; x € & for any transition probabilities P” and P?
and for all 6 € (0, 1). P in (1A.12) represents any transition probabilities and q
in (1A.13) represents any input probability distribution.
PRooF For any given P and q let us denote by p the output distribution
Ply) = > Ply|x)a(x) ye (1A.14)
For fixed P it should be clear that when input distributions q” and q? result
in output distributions p™) and p'” respectively, then the input distribution
Oq’) + (1 — 0)q? results in the output distribution Op” + (1 — 0)p. Now
note that
I(q; P) = >) a(x) © p(y|x) log p(y|x) + H(p) (1A.15)
x y
where H(p) is the entropy of the output alphabet. The first term in (1A.15) is
linear in q and therefore convex - in q. The second term is convex - in p, as
established by the argument following (1A.10). But since p is linear in q this
means that it is also convex - in q. By property 1 we see that I(q; P) is convex
- in q for fixed P. This proves (1A.12).
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 39
To prove (1A.13), let
P(y|x) = Op(y|x) + (1— O)p(y|x) ye Yandxe
and
P(y) = >) P(y|x)a(x) yee
Then
(In 2)J(q; OP + (1 — 0)P
(y | x)
ae x)p(y|x) In A
d 2 q(x)p (y| ) p(y)
(y| x)
= 0 ((y |x) In Ew
d 2 q(x)p (y|x) p(y)
py |x
+ (1-48) > > a(x)p(y | x) In sy. (1A.16)
5 OND «
Next using the inequality In x < x — 1 we have
)p (y |x) as | x)
p(y)
p(y |x) p(y |x)p(y)
= x)p™ (yp |x) }In +
d 2 q\ )p (y| | p(y) Np ay (y |x)
p(y |x) +(? p(y |x)p(y) -1)|
(1) l
S22 aPOT 9) HI “TDG * pypMY] x)
p®) (1)
= (in 214; PO) + DE ator (Ge? — 1)
vs (y)p(y |x)
-= (In 2)I(q; P®) (1A.17)
since the second term sums to zero. Similarly
p(y |x
EE absyo™rlx) in PE < (in 200g: P) (14.18)
Using (1A.17) and (1A.18) in (1A.16) we have the desired result (1A.13).
We have shown here that the fundamental parameters, entropy and average
mutual information, have certain convexity properties. In subsequent chapters we
shall encounter other important parameters of information theory that also have
convexity properties.
40 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
APPENDIX 1B JENSEN INEQUALITY FOR
CONVEX FUNCTIONS
f(x)
Lemma Let f(-) be a convex 4 real-valued function defined on the real line.
Let x be a random variable with finite expectation. Then
El f(x)] <f(ELx])
For convex u functions, the inequality is reversed.
PRroor We first prove this for a discrete finite sample space. The definition of
a convex function is most concisely stated as the property that any line
segment connecting two points (x,, f(x,)) and (x2, f(x2)) must lie below the
function over the interval x; < x < x, (see Figure 1B.1). Consider first the
distribution p,, p2 = 1 — p, for a binary-valued random variable. Then it
follows from the definition of the line that the point (p, x; + P2X2,Pif (x1) +
P2 f(x2)) lies on the line and hence must lie directly below the point
(p1X1 + P2X2,f (py, X1 + P2X2)) on the function. It follows that
Prt (X1) + Po f(X2) <f(Pix1 + P22) (1B.1)
Now extending to a _ three-point distribution, p,, p2,p3 where
Pi + P2 + p3 = 1,
Pif (x1) + Po f(X2) + Ps f (xs)
oh Pi
= (p, + ra) fe if a) + D,
rsx) + p3 f (xs)
Pi P2
<(pi +p | Xyet x J+ x 1B.2
( 1 2)f Di Ds 1 a4 ps 2 P3 f ( 3) ( )
where we have used (1B.1) recognizing that the coefficients p, /(p; + p2) and
P2/(p; + pz) constitute a binary distribution defined at the points x, and x).
(py xy + pX>)
f(x)
|
| |
| Pr flry) + Poflxy)
| |
|
|
f(x,)
a PiX1 + P2X x7 Figure 1B.1 Convex © function.
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 41
Again using (1B.1) on the binary distribution (p, + p,) and p; defined at
the points € = (p,x; + p2X2)/(py + p2) and x3, we have
(pi + P2)f(E) + Ps f (x3) <f[(Pi + P2)é + P3xs] (1B.3)
Substituting for € and combining (1B.2) and (1B.3), we obtain
Dif (X1) + Po f(X2) + D3 (x3) <f(pix1 + P2X2+P3X3) (1B.4)
We proceed to extend by induction to a finite distribution of order n.
Suppose that for a distribution of order n — 1
a1 n-1
» Pi f (xi) <s( z pix (1B.5)
i=1 i=1
Then for order n
n n-1 SS D; f (x;)
» P; f (x;) = 2 pS + p, f(x)
is ge dP
nF
< Dp S(E) + Pa S (Xn) (1B.6)
i=1
where
die n-1
c=") PjX;/ 2, Pi
j=l i=1
and where we used (1B.5) and the fact that p,/)?={ p;, for j= 1, 2, ...,
(n — 1), constitutes an (n — 1)-point distribution. Now applying (1B.1) on the
binary distribution }'7=/ p; and p,,, it follows that
Ls) + Pn f (Xn) <s(¥ pe + Pa%s] (1B.7)
Finally, combining (1B.6) and (1B.7), we have
yp, Ff (x)) <s( ¥px) (1B.8)
or E[ f (x)] < f(E[x]), as was to be shown.
Extension to any infinite discrete sample space is direct as is extension to any
distribution function P(-) for which the Stieltjes integral | f(x) dP(x) exists. For
such cases (1B.8) becomes
{ f(x) dP(x) < s| [x AP(x)] (1B.9)
42 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Inequalities (1B.8) and (1B.9) can be expressed generically as
El f(x)] <f(ELx])
where f(x) is a convex ~ function. If f(x) is convex vu, it immediately follows that
all inequalities are reversed.
PROBLEMS
1.1 Examples of entropy
(a) For a DMS with A = 2 output letters Y = {a,, a,} and P(a,) = p, show by direct differentia-
tion that the entropy
1 1
H(%) = #(p) = p log — + (1 — p) log ——
Pp i+,
is maximized when p = 3.
(b) For the binary source in (a) consider sequences of two outputs as a single source output of an
extended source with alphabet %, = {(a,, a,), (a, 42), (a, a), (a2, a2)}. Show directly that
H(%,) = 2H(%).
(c) Consider the drawing of a card (with replacement) from a deck of 52 playing cards as a DMS.
What is the entropy of a randomly selected card? Suppose suits are ignored so that the output space is
now W = {A, 2, 3, 4, 5, 6, 7, 8, 9, 10, J, Q, K}. What is the entropy of a randomly selected card now?
What if WY = {face card, not a face card}?
(d) What is the entropy of the output of the toss of a fair die? Suppose the die is biased so that the
probability of any face is proportional to the number of dots; now what is the entropy?
1.2 Given a sequence of discrete random variables u,, u,,..., Uy With alphabets UW, W, ..., @™ and
a joint probability Py(u) for ue WwW) x YW? x --- x W™). Its entropy is
1
A(M)Q® ©. U™) = ¥ P,(u) log ——
( ) d y(u) g P,(u)
Show that
N
(UY Led UN?) < > A(u™)
n=1
with equality if and only if the random variables are independent. Here H(¥) is the entropy of the nth
random variable.
1.3 For an arbitrary stationary ergodic source define entropy as
H(%y)
1
where H(%y) = > Py(u) log ——
‘ u
The asymptotic equipartition property of stationary ergodic sources gives
lim Fy =0
N>o
1 a‘
for any « > 0, where pe pele ne Py(u) — H ><.
|
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 43
(a) Show that
H(W,) _ H(%,)
OR ES
for k<n
(b) Prove the noiseless source coding theorem assuming that
lim Fy = 0
No
1.4 (Chebyshev Inequality and the Weak Law of Large Numbers)
(a) Show that for a random variable x with mean m and variance o7, Pr {|x — m| > €} < o7/e?
for any « > 0.
(b) Let z,, 22, ..., Zy be independent identically distributed random variables with mean Z and
variance a”. Show that for any « > 0
1 N
Pr {2 peeette a
and
1 N 2
Pr: N 2a? ><
Hint: Lower bound the variance of x by reducing the region of integration.
1.5 (Chernoff Bound) Show that F, defined in (1.1.23) decreases at least exponentially with N by the
following steps:
(a) Define S(N, ¢), = {u: (1/N) log P,(u) — H > é}.
(b) For z, defined in (1.1.26), note that for ue S(N, €),
N
¥z, -N(H + €)>0
n=1
Hence for any s > 0 show that
Fy = 2 P,(u)
ue S(N,€)+
s[ SMa +0)
Z n=1
<E
ae 2- NGG)
where
G(s) = s[H + «] — log E[2**]
| A
= s[H + «]—log| } Pa)
k=1
Hint: 1 < exp s[)’_, z, — N(H + ¢)] forue S(N, 6), .
(c) By examining the first two derivatives of G(s) show that for some s* > 0 we have
Fx gia Toe?
where G(s*) > 0.
(d) Do the same for
S(N, €)_ = {u: (1/N) log P,(u) — H < —6}
then combine with the result of (c) to get the desired bound.
44 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
1.6 Assume a DMS with alphabet W = {a,, a,,..., a,} and probability P(u) for ue &.
(a) For each u € & pick a binary codeword of length I(u) which satisfies
lo . < I(u) <1 +1
a Par ce Pe
Show that the average length
(L) = Y P(u)l(u)
satisfies
H(%) < {L) < H(%) +1
(b) Repeat (a) for source sequences of length N and obtain A” binary codewords of lengths
{I(u) : ue W%,} with average length
(Ly) = ¥ Py(u)i(u)
which satisfies
NH(%) < (Ly) < NH(%) +1
(c) Show that the code words in (a) and (b) can be chosen such that no codeword of length / is
identical to the first | bits of a codeword of length greater than or equal to |. That is, no codeword is a
prefix of another. Such a set of distinct codewords has the uniquely decodable property that no two
different codeword sequences can form the same binary sequence. Hence with these codes the source
outputs can be uniquely determined from the binary-code sequence.
1.7 Show that
(a) I(%; Y) < H(%)
and
(4; Y) < HY)
where (2; Y) is defined by (1.2.7) and
H(X) = }, a(x) log ——
; q(x)
1
H(Y) = d p(y) log ty)
(b) H(%; Y) = H(Y) + H(2|%)
where
H(%; ¥) = d d p(x, y) log aia
eos ~ ~ pes 8 q(x|y)
1.8 Find the average mutual information between inputs and outputs of the following DMCs. Then
find their capacities.
(a) The binary erasure channel (BEC) of Fig. P1.8a
(b) The Z channel of Fig. P1.8b
DIGITAL COMMUNICATION SYSTEMS: FUNDAMENTAL CONCEPTS AND PARAMETERS 45
1—p
0 0 1
2
0 0
p
1
E 2
]
Pp 1 1
I l
Ses, (a) (b) Figure P1.8
1.9 For the BEC given in Prob. 1.8(a) suppose that the encoder can observe the outputs of the
channel and constructs a variable length code as follows:
When the information symbol (assume a zero-memory binary-symmetric information source) is a
“0,” then the encoder keeps sending Os across the channel until an unerased output is achieved. If the
information symbol is a “1,” then the encoder keeps sending Is until an unerased output is achieved.
For each information symbol the number of channel symbols used is a random variable.
Compute the average codeword length for each information bit. What is the rate of this encoding
scheme measured in information bits per channel use? What is the information bit error probability?
1.10 There are two biased coins in a box. The first coin when flipped will produce a “head” with
probability 3 while the second coin will produce a “head” with probability 4. A coin is randomly
selected from the box and flipped.
(a) If ahead appears how much information does this provide about the first coin being selected ?
The second coin?
(b) What is the average mutual information provided about the coin selected when the outcome
of a flip of a randomly selected coin is observed?
1.11 There are 13 coins of which 12 are known to have equal weight. The remaining coin is
either the same weight or heavier or lighter than the other coins. The objective is to find the
odd coin, if any, after the coins are mixed and determine whether the odd coin is heavy or light
by using a balance and a known standard coin.
(a) Show by considering the information provided that it is impossible to guarantee solving the
problem in two uses of the balance. Similarly show that it might be possible always to solve the
problem in three weighings.
(b) By trying to maximize the average information provided by the three weighings, give a
weighing strategy that works.
(c) Show that three weighings are not enough without the standard coin.
1.12 For a finite alphabet W consider the three distributions P,(u), P,(u), and P,(u) = AP ,(u) +
(1 — A)P,(u) for all ue W and Ae (0, 1). Let H,(W), H,(%), and H,(%), be the corresponding entro-
pies. Using inequality (1.1.8) show that
AH ,(&) + (1 — A)H,(%) < H,(%)
1.13 Let y be an absolutely continuous random variable with probability density function p(y), ye Y
where
| yely)dy=0 and =| y*p(y) dy = oF
Using a version of (1.1.8) show that
pe I 1 2
| p(y) log dy < } log (2nec?)
ae Py)
with equality when y is a Gaussian random variable. Use this to find the maximum mutual information,
46 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
1(2; Y) for the additive Gaussian noise channel of Fig. 1.5 where we maximize over the input probabi-
lity density function q(x), x € # subject to
| xq(x)dx =O and | x?q(x) dx =&
7
1.14 Use the source encoder discussed in Prob. 1.6(b) to show that in the limit of large N the
combination of source and source encoder approximates (in the sense that H,, defined below,
approaches 1 as N > 00) a binary-symmetric source. Do this by the following steps.
(a) If source sequence ue W, is mapped into codeword x(u) of length J(u) then the encoder
output has normalized information of
—log P,(u)
i(u)
The average information per binary symbol out of the source encoder is then
—log Py(u)
a}
bits/binary symbol
Ay = b> P,ia)(
Show that
1l+——— = Ay Ss! h P(u*) = P
N log Plu*) ‘ where (u*) ape (u)
(b) Next show that the binary-symmetric source (BSS) is the only binary source that has Hy = 1
for all N.
1.15 Use the source encoder described in the proof of Theorem 1.1.1 (see Fig. 1.3) to show that in the
limit of large N the combination of source and source encoder becomes a binary-symmetric source in
the sense that H, > 1 as N > oo.
CHAPTER
TWO
CHANNEL MODELS AND BLOCK CODING
2.1 BLOCK-CODED DIGITAL COMMUNICATION ON THE
ADDITIVE GAUSSIAN NOISE CHANNEL
The most general digital communication system to be treated in this chapter and
the next is that shown in Fig. 2.1. The input digital data is usually binary, but may
have been encoded into any alphabet of g > 2 symbols. The incoming data which
arrives at the rate of one symbol every T, seconds is stored in an input register
until a block of K data symbols’ has been accumulated. This block is then pre-
sented to the channel encoder, as one of M possible messages, denoted H,, H>,
..., Hy where M = g¥ and gq is the size of the data alphabet. The combination of
encoder and modulator performs a mapping from a set of M messages, {H,,,}, onto
a set of M finite-energy signals, {x,,(t)}, of finite duration T= KT,.
While the encoder-modulator would appear thus to perform a single indivi-
sible function, it can in‘fact be divided into separate discrete-time and continuous-
time operations. The justification for this separation lies in the Gram-Schmidt
orthogonalization procedure which permits the representation of any M finite-
energy time functions as linear combinations of N < M orthonormal basis func-
tions. That is, over the finite interval 0 < t < T the M finite-energy signals x,(t),
X(t), ..., Xy(t), representing the M block messages H,, H>, ..., Hy respectively,
can be expressed as (see App. 2.A)
Xft) = SO Nees at
‘ When the data alphabet is binary, these are generally called bits, whether or not they correspond
to bits of information in the sense of Sec. 1.1.
47
‘WI9}SAS UOIPOIUNUWIWIOD [e}ISIP popod-yoo[g J°7 aansIy
me
|
|
|
|
IOAIONOY — | 10}e[Npowsg } >| Jopoooq -—>
Z hon ul
(7) | VH
|
|
|
jouuery)
| syoo]q I9}SI30I 9BeYS Y non
| oBeSSOP ere
<— Joyusurl, be JoyeNPOW j<— iapoouq b« | ae | | +——
62 age 2 | Wy kw tt eal} aes
|
|
|
a |
48
N&
———
If
JOUURYO NOMY JOj JOW[Npowosp poled g['7 aansiy
(1) No
(1) “o
(1) '
(q)
p (JA
(1)U
@) “a
(J) uly
‘1OJL[NPOW-JapoOoud paylejoq PLZ sans
(1) No
(1) “
(1) '@
(?)
NU y
TU y
[uy
Japoouy
H
49
50 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where for each m and n
Xan = [nll Pa(0) a
and the basis functions {@,(t), @2(t), ..., Py(t)} are orthonormal:
1 fk=j
0 ifk #j
and N < M. In fact, N = M if and only if the signals are linearly independent. A
consequence of this representation is that the signal energies can be expressed as
square norms of the vectors
iF dilt)p,(t) dt = 64; = | (2.1.2)
Mo as Meads oa ea m=1,2,...,M
for it follows from (2.1.2) that for each m
En = [ bal dt
T
; ‘na (0)) dt
oe y »s Xana | nd j(t) dt
n=1 j=1 0
= Y Bim?
= |x, ||? (2.1.3)
The representation (2.1.1) suggests the general implementation of encoder and
modulator shown in Fig. 2.1a. Thus the encoder becomes a mapping from a
discrete set of M messages to a vector of N < M real numbers. The most general
modulator consists of N amplitude modulators [waveform @¢,(t) modulated by
amplitudes x,,, for n= 1, 2, ..., N] followed by a summer. In fact, this most
general form is considerably simplified, as will be discussed in Sec. 2.7, when the
amplitudes {x,,,,} are constrained to be elements ofa finite alphabet so that strictly
digital encoders can be used, and when the basis functions {@,(t)} are chosen to be
disjoint time-orthogonal (i.e., functions which take on nonzero values on disjoint
time intervals) only a single time-shared modulator need be implemented.
The transmitter and receiver of the general system of Fig. 2.1, together with
the propagation medium, may be regarded as a random mapping from the finite
set of transmitted waveforms {x,,(t)} to the received random process y(t). All sorts
of distortions including fading, multipath, intersymbol interference, nonlinear
distortion, and additive noise may be inflicted upon the signal by the propagation
medium and the electromagnetic componentry before it emerges from the
receiver. At this point the only disturbance that we will consider is additive white
CHANNEL MODELS AND BLOCK CODING 51
Gaussian noise, both to establish a minimally complex model for our starting
point, and also because this model is in fact very accurate for an important class of
communication systems. In Secs. 2.6 and 2.12 we shall consider the influence of
some of the other forms of disturbance just mentioned.
The additive white Gaussian noise (AWGN) channel is modeled simply with a
summing junction, as shown in Fig. 2.1b. For an input x,,(t), the output is
y(t)=x,(t)+n(t) O<t<T (2.1.4)
where n(t) is a stationary random process whose power is spread uniformly over a
bandwidth much wider than the signal bandwidth; hence it is modeled as a
process with a uniform arbitrarily wide spectral density, or, equivalently, with
covariance function
R(t) = (No/2) 4(c) (2.1.5)
where 6(-) is the Dirac delta function and No is the one-sided noise power spectral
density.°
The demodulator—decoder can be regarded in general as a mapping from the
received process y(t) to a decision on the state of the original message H , . But, for
this specific channel model, the demodulator—decoder can also be decomposed
into two separate functions which are essentially the duals of those performed by
the encoder—-modulator. Consider first projecting the random process y(t) onto
each of the modulator’s basis functions, thus generating the N integral inner
products
Fr
= | y(t),(t) dt ae Cae ee (2.1.6)
This can be performed by the system of Fig. 2.1b. We define also
T
n, = | n(t)b,(t)dt n=1,2,...,N (2.1.7)
0
and hence it follows from (2.1.1) and (2.1.4) that
Vn = Xmn + My a 54, ..., N (2.1.8)
Now consider the process
9) 3.00 (2.1.9)
n=1
? Although the propagation medium naturally attenuates the signal, we may ignore this effect by
conceptually amplifying both signal and noise to the normalized pretransmission level.
> This means that, in response to this noise input, an ideal bandpass filter of bandwidth 1 Hz would
produce an output power of No watts.
52 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Given that x,,(t) is the transmitted signal, it follows from (2.1.9) and (2.1.1) that
this process can be written as
=n(t)— > n,@,(t) = A(t) (2.1.10)
which depends only on the noise process. Thus we may represent the original
process as
HO) = ¥ yudall) + HO= Y yadalt) + H(0) (21.11)
Now, as will be elaborated upon in Sec. 2.2, any statistical decision regarding
the transmitted message is based on the a priori probabilities of the messages and
on the conditional probabilities (densities or distributions) of the measurements
performed on y(t), generally called the observables. Suppose, for the moment, that
we take as the observables only the N projections {y,} defined by (2.1.6). Because
y(t), defined by (2.1.4), is a Gaussian process, the observables are Gaussian vari-
ables with means depending only on the corresponding signal components, since
J &
Eval %m) = | Xnlt)a(t)
ik ee Oe ena « | (2.1.12)
and with variances equal to N,/2, since for any n
var [yn | Xm] ~ El (Y, i cn | Xml
= N,/2 (2.1.13)
CHANNEL MODELS AND BLOCK CODING 53
Similarly, it follows that these observables are mutually uncorrelated since, for
Oe at
cov Lyn V1|Xm] 2m E[n,,n)]
= EL |” [ meyuddyodo(e) dea
= (N./2) [dale dt
=0 (2.1.14)
which, since the variables are Gaussian, implies that they are also independent.
Then defining the vector of N observables
» oa (v1 V20-++5 yy)
whose components are independent Gaussian variables with means given by
(2.1.12) and variances N,/2, it follows that the conditional probability density of y
given the signal vector x,, (or equivalently, given that message H,,, was sent) is
Py(¥ |Xm) =I P(¥n | Xmn)
e — [yn Xm] 2/No
Ti SRN,
Returning to the representation (2.1.11) of y(t), while it is clear that the vector
of observables y = (y;, y2, ---, yy) completely characterizes the terms of the
summation, there remains the term f(t), defined by (2.1.10), which depends only
on the noise and not at all on the signals. Furthermore, since the noise has zero
mean, fi(t) is a zero-mean Gaussian process. Finally, n(t), and hence any observ-
able derived therefrom, is independent of all the observables {y,} because
Etat] = £m) | uu) a
a,
(2.1.15)
=E (no — Y malt) + nj)
n=1
= E|n(t) fmt) (u)p ,(u) ) au} - Sal E(n,n,)o,(t)
=(N, /2)p (t) — (N./2)(¢)
=0 ie
Thus, since any observable based on f(t) is independent of the observables {y,}
and of the transmitted signal x,,, it should be clear that such an observable is
irrelevant to the decision of which message was transmitted. More explicitly, if fi is
54 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Hy Xm Channel y Ay,
—_————P =Encoder > <-> Decoder
Py (y |X, )
Figure 2.2 General memoryless channel.
any vector of N’ observables based only on f(t), then it follows from the above that
the joint conditional probability density is
Pn +n lY, in| x,,) 240 Pw(Y | Xm)Pw- (it)
Since the term p,_(fi) enters into all the conditional densities (for m = 1, 2,..., M)
in identically the same way, it is useless in making the decision.
Hence, we conclude finally that the components of the original observable
vector y are the only data based on y(t) useful for the decision and thus represent
sufficient statistics. Therefore the demodulator can be implemented as shown in
Fig. 2.1b. The time-continuous process is thus reduced by the demodulator to the
N-dimensional random vector y which then constitutes the input to the decoder
whose structure we shall study in the next section. We may summarize the results
of this section by noting that, for the AWGN channel, by using the general but
explicit forms of modulators and demodulators of Figs. 2.la and 2.1b, we can
N-dimensional random vector y which then constitutes the input to the decoder
model of Fig. 2.2 where the channel is in effect a random mapping defined by the
conditional probability density
pul |%a) = TT eval >a) (2.1.16)
While this result has only been shown to characterize an AWGN channel, many
other channels can be characterized in this way. Any channel whose conditional
(or transition) probability density (or distribution) satisfies (2.1.16) is called a
memoryless channel. We shall discuss a class of memoryless channels derived from
the AWGN channel in Sec. 2.8, and give more elaborate examples in Sec. 2.12.
2.2 MINIMUM ERROR PROBABILITY AND MAXIMUM
LIKELIHOOD DECODER
There remain the problems of characterizing more explicitly the encoder and
decoder. Both will occupy the better part of this book. The principles and optimal
design of the decoder are more easily developed, although its implementation is
usually more complex than that of the encoder. The goal of the decoder is to
perform a mapping from the vector y to a decision H, on the message transmitted.
Such a decision must be based on some desirable criterion of performance. The
most reasonable, as well as the most convenient, criterion for this decision is to
minimize the probability of error of the decision. Suppose that, when the vector y
takes on some particular value (a real vector), we make the decision H, = H,,.
CHANNEL MODELS AND BLOCK CODING 55
The probability of an error in this decision, which we denote by P,(H,,,; y), is just
P,(H,,; y) = Pr (H,, not sent | y)
= 1 — Pr (H,, sent|y) (2.2.1)
Now, since our criterion is to minimize the error probability in mapping each
given y into a decision, it follows that the optimum decision rule is
H,=H,, if Pr (H,, sent|y) > Pr (H,, sent|y) for all m’ 4m (2.2.2)
If m satisfies inequality (2.2.2) but equality holds for one or more values of m’, we
may choose any of these m’ as the decision and achieve the same error probability.
Condition (2.2.1), which is completely general for any channel (memoryless or
not), can be expressed more explicitly in terms of the prior probabilities of the
messages
Reet te fen ome 12M (2.2.3)
and in terms of the conditional probabilities of y given each H,,, (usually called the
likelihood functions*)
Py(y | Hm) = Py(y | Xm) m=1,2.....M (2.2.4)
This last relation follows from the fact that the mapping from H,, to x,,, which is
the coding operation, is deterministic and one-to-one. These likelihood functions,
which are in fact the channel characterization (Fig. 2.2), are also called the channel
transition probabilities. Applying Bayes’ rule to (2.2.2), using (2.2.3) and (2.2.4),
and for the moment ignoring ties, we conclude that, to minimize error probability,
the optimum decision is
Tom PY | Ne) he Puy | Xm’)
agli cs a)
Since the denominator p,(y), the unconditional probability (density) of y, is
independent of m, it can be ignored. Also, since it is usually more convenient to
perform summations than multiplications, and since if A > B>0 then In A>
In B, we rewrite (2.2.5) as
Hexen. tt for all m’' +m (2.2.5)
H,=H, if Inz,,+ In py(y|x,,) >In 2, + In py(y|x,) for all m’ #m
(2.2.6)
For a memoryless channel as defined by (2.1.16), this decision simplifies further to
A, = A
N N
if In Tm =o :S In P(Yn | Xmn) > In Tin’ - 2, In P(Yn| Xm‘n)
n=1 n= s
for all m’#m = (2.2.7)
* py(-) is a density function if y is a vector of continuous random variables, and is a distribution if y
is a vector of discrete random variables.
56 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Another useful interpretation of the above, consistent with our original view
of the decoder as a mapping, is that the decision rule (2.2.6) or (2.2.7) defines a
partition of the N-dimensional space of all observable vectors y into regions A,,
Ag; .2., Age where
= {y: In z,, + In py(y|Xm) > In ty + In py(y|X,)} for all m’ # m} (2.2.8)
As is Clear from their definition, these regions must be disjoint, i.e.,
A, 0 Aj= DB for all k #+j (2.2.9)
Then the decision rule can indeed be considered as the mapping from y to H,,, such
that
if yeA,, then H,(y) = H,, (2.2.10)
Aside from the boundaries between regions, it is also clear from definition (2.2.8)
that the regions A,, cover the entire space of observable vectors y. We shall adopt
the convention that all ties will be resolved at random. That is, the boundary
region between A,, and A,,,, consisting of all y for which (2.2.8) becomes an
equality, will be resolved a priori by the flip of a fair coin; the outcome of such a
flip does not alter the ultimate error probability since, for y on the boundary,
(2.2.2) is satisfied with equality. It then follows that the union of the regions covers
the entire N-dimensional observation space ¥,; that is
M
(J A, = Wy (2.2.11)
m=1
The above concept can best be demonstrated by examining again the AWGN
channel defined by (2.1.15). Since the channel is memoryless, we have, using
(2.1.16) and (2.1.3) and the boundary convention?
Am = {y: in (2 =) 4 yn [Pete >0 forall m’ 4m
pre 4 in
1
= {yn (22) — Fy ml? + fy — Xe? 20 for all! 4 ml
ae in (2 =) +5. ( 3 2b Xm'nlYn
ae N, n=1
N
— ¥ [Xa]? + Y wal? = 0 for all mi + |
n=1 n=1
- ite
_ ty N, ‘Xm —Xn,Y)— noes m) + In (=) > 0 for all m' # m|
(2.2.12)
° We denote the inner product }°7_, a,b, of vectors a = (a;, a), ..., dy) and b = (by, b2, ..., by)
by (a, b).
CHANNEL MODELS AND BLOCK CODING 57
site eae, aca Neg $7
vy ne >
Ys
~ 7 x
xX} X3 1 3
x x * A x
Ay A3 Ay 7 ee A3
7 N
7 by
le : aii of X4 X ‘
a 4 Sips Ps \
Pe ~
(a) &, = &; < &) = &% (b) &m = &,%_, =1 m=1,2,3,4
f° % ~ 8),= 43
Figure 2.3 Signal sets and decision regions.
Note also that, by virtue of (2.1.1) and (2.1.6)
6m — Xue = | Exalt) — Xml)
while
T
a | x2(t) dt
0
Thus it follows from (2.2.12) that for the AWGN channel the decision regions are
regions of the N-dimensional real vector space, bounded by linear
[(N — 1)-dimensional hyperplane] boundaries. Figure 2.3a and b gives two
examples of decision regions for M = 4 signals in N = 2 dimensions, the first with
unequal energies and prior probabilities, and the second with equal energies and
prior probabilities, i.e., with &,, = & and z,, = 1/M for all m. Decision regions for
more elaborate signal sets are treated in Probs. 2.1 and 2.2. We note also from this
result and (2.2.12) that the decision rule, and hence the decoder, for the AWGN
can be implemented as shown in Fig. 2.4, where the M multipliers each multiply
the N observables by N signal component values and the products are succes-
sively added to form the inner products. When the prior probabilities and energies
are all equal, the additional summing junctions can be eliminated. Examples of
decoders for other channels will be given in Secs. 2.8 and 2.12.
In most cases of interest, the message a priori probabilities are all equal; that
is,
m=1,2,...,M (2.2.13)
58 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(No In m, — ;)/2
Accumulator | + >
(No In Tl. — &)/2
Xj
d Accumulator + >
¥ X5 Maximum Hin
ee detector
(No In 18 Y gaa &u)/2
| Accumulator + >
Xy
Figure 2.4 An implementation of decoder for AWGN channel.
As was discussed in Chap. 1, this is in fact the situation when the original data
source has been efficiently encoded into equiprobable sequences of data symbols.
In this case, the factors z,, and 7,, can be eliminated in (2.2.5) through (2.2.8) and
(2.2.12). The decision rule and corresponding decoder are then referred to as
maximum likelihood. The maximum likelihood decoder depends only on the chan-
nel, and is often robust in the sense that it gives the same or nearly the same error
probability for each message regardless of the true message a priori probabilities.
From a practical design point of view, this is important because different users
may have different message a priori probabilities. Henceforth, in the text we shall
assume only equiprobable messages® and thus the maximum likelihood decoder
will be optimum. Unequal prior probability cases will be treated in the problems.
For a memoryless channel, the logarithm of the likelihood function (2.2.4) is
commonly called the metric; thus a maximum likelihood decoder computes the
metrics for each possible signal vector, compares them, and decides in favor of the
maximum.
2.3 ERROR PROBABILITY AND A SIMPLE UPPER BOUND
Having established the optimum decoder to minimize error probability for any
given set of observables, we now wish to determine its performance as a function
of the signal set. Given that message H,,, (signal vector x,,) was sent and a given
© Note that for the AWGN channel, unequal prior probabilities requires only inclusion of the
additive term in (2.2.12).
CHANNEL MODELS AND BLOCK CODING 59
observation vector y was received, an error will occur if y is not in A,, (denoted
y ¢A,, or y € A,, ). Since y is a random vector, the probability of error when x,, is
sent is then
Py, = Pr {y € An|Xm)
= 1-—Pr{yeA,,|x,}
yeAm
We use the symbol }' to denote summation or integration over a subspace of the
observation space. Thus, for continuous channels (such as the AWGN channel)
with N-dimensional observation vectors, > represents an N-dimensional integra-
tion and p,(-) is a density function. On the other hand, for discrete channels where
both the x,, and y vector components are elements of a finite symbol alphabet,
represents an N-fold summation and p,(-) represents a discrete distribution.
The overall error probability is then the average of the message error
probabilities
M
Pp = ) Pe.
m=1
M
= (1/M) d Pin (2.3.2)
Although the calculation of P, by (2.3.2) is conceptually straightforward, it is
computationally impractical in all but a few special cases (see, e.g., Probs. 2.4 and
2.5). On the other hand, simple upper bounds on P;, are available which in some
cases give very tight approximations. When these fail, a more elaborate upper
bound, derived in the next section, gives tight results for virtually all cases of
practical interest.
A simple upper bound on P; is obtained by examining the complements A,, of
decision regions. By definition (2.2.8) with z,,=1/M for all m, A,, can be
written as’
An = {y: In py(y |Xm) > In py(y | Xm) for some m' + m\
U ty: In py(y |X) = In p(y |Xmn)}
m' =m
oc (2.3.3)
m' =m
where
’ We take for the moment the pessimistic view that all ties are resolved in favor of the other
message, thus at worst increasing the error probability. We note, however, that for continuous chan-
nels such as the AWGN, the boundaries do not contribute measurably to the error probability.
60 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
A
\ 12
XG K X> p> Ay3
Me
Xj ee Xj X3
as . x %
‘XN
NX
oe
-—
‘\
7
IX
eo Ag
s) Aceon ae
7%
7
7
7
oe
X X4
Pig
Figure 2.5 A,,, regions for signal set of Fig. 2.3b. Ay = Ay, UAq3 U Aga.
Note that each of the terms A,,,,,, of the union is actually the decision region for x,,,,
when there are only the two signals (messages) x,, and x,,,,. An example based on
the signal set of Fig. 2.3b is shown in Fig. 2.5. Using (2.3.3) in (2.3.1), we find from
the axioms of probability that
Pr, 7 Pr ty € An|Xm}
=Priye |) Awn
m’'#=m
<= * Prive AS ise)
m’ =m
a ee 3
'F
Xm
n Py(Y | Xm’)
Py(Y | Xm) =e
= ) P,(m>m’) (2.3.4)
m’' =m
xX
m
m Fm
where P,,(m > m’) denotes the pairwise error probability when x,, is sent and x,,. is
the only alternative. We note that the inequality (2.3.4) becomes an equality®
* Also, for some trivial channels, P,(m —> m’) # 0 for at most one m' # m, thus obviously satisfying
(2.3.4) as an equality.
CHANNEL MODELS AND BLOCK CODING 61
whenever the regions A,,,,,, are disjoint, which occurs only in the trivial case where
M = 22. For obvious reasons the bound of (2.3.4) is called a union bound.
For the AWGN channel, the terms of the union bound can be calculated
exactly, by using (2.2.12) with z,, = 7,,. This gives
P,(m— m’) = Pr (Zmm < (Em — Em’) 2 (2.3.5)
N,
where
2 N
Pui = N, 2 Cnn Xm n)Yn
2
aan N, (Xin Xm’ > y)
But, since x,, was sent
| a wee Te (2.3.6)
for each n is a Gaussian random variable with mean x,,,, and variance N,/2. Also,
as was shown in (2.1.14), y, and y, are independent for all n + I. Hence, since Z,,,,
is a linear combination of independent Gaussian variables, it must be itself Gaus-
sian; using (2.1.3) and (2.3.6), we find its mean
N
ae ach gah 7 EA Se
=1
E(Z yon’ | 7 ia
Z| tv Ziv
a (2.3.7)
and its variance
4 N
var (Zium’ | Xm) %: N2 a cae By br var (Yn | mn)
o n=1
= — ||x,, — Xm |? = 02 (2.3.8)
Thus
—(Em—Em)/INo g~ (Zmm'— Hz)?/(2027)
—2 /2n0,
AZ nn
62 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where
(Em ae Em)/ Nol + Uz
0,
B x
1
rate - Em ae 2(Xim' s Xm)
ane |
Xm’ |
232
“JN, ( )
This leads finally to the simple expression
Xm — Xm’ !
P \= 2.3.10
e(m-> mi) = o[ = (23.10)
where Q(-) is the Gaussian integral function
i: dx . dx
= OO Se ee eouat
Q(B) Le / on | € /2n ( )
Returning to the error probability bound (2.3.4), we now derive a weaker but
completely general bound on P;(m— m’). It follows immediately from (2.3.4) and
(2.3.3) that
= )) Pyly|Xm)
y € Amm’
where
Pn (y | Xm’) |
Avon = e ———— > lic (2.3.12)
| Py(y | Xm)
and %, is the entire observation space. We may express this alternatively as
P,(m-m’) = 2 F (y)Pw(y | Xm) (2.3.13)
where
1 for all y € Aji
fy) =| :
10. — for all y ¢ Aju’
But we may easily bound f(y) by
ji < | Pwl¥ Xm’) for all y © Aju
Pry | Xm)
for all y € Ajm
CHANNEL MODELS AND BLOCK CODING 63
where the upper branch bound follows from (2.3.12), while the lower branch
bound follows trivially. Then since the factors in the summands of (2.3.13) are
everywhere nonnegative, we may replace f(y) by its bound (2.3.14) and obtain
P,(m— m’) < XV Pxty/¥m) ‘Pn (¥|Xm) (2.3.15)
The expression (2.3.15) is called the ore hiaee bound, and its negative logar-
ithm the Bhattacharyya distance. It is a special case of the Chernoff bound which
will be derived in the next chapter (see also Prob. 2.10).
Combining the union bound (2.3.4) with the general Bhattacharyya bound
(2.3.15), we obtain finally a bound on the error probability for the mth message
» P,(m—> 7m’)
< YY Vwly|Xm)Pwly |Xm) (2.3.16)
y m' =m
The interchange of summations is always valid because at least the sum over m’ is
over a finite set. Equation (2.3.16) will be shown to be a special case of the more
elaborate bound derived in the next section.
To assess the tightness of the Bhattacharyya bound and to gain some intui-
tion, we again consider the AWGN channel and substitute the likelihood func-
tions of (2.1.15) into (2.3.15). Then, since ¥y is a space of real vectors, we obtain
Palm!) <—vwn fo [ex |— ay [hy — Xe I? + by — Xe}
= exp {—||x,, — x, ||7/4N,} (2.3.17)
Comparing the bound (2.3.17) with the exact expression (2.3.10), we find that we
have replaced Q(B) by exp (— 87/2). But it is well known (see Wozencraft and
Jacobs [1965]) that
Se ae i eee
Thus, for large arguments, the bound (2.3.17) is reasonably tight. Note also that
the negative logarithm of (2.3.17) is proportional to the square of the distance
between signals. To carry this one step further and evaluate the tightness of the
union bound, we consider the special case of M equal-energy M-dimensional
signals, each with a unique nonzero component
ifn+m
if 2 mh
Xmn = ) og
mn SE Smn \Vé
(This is a special case of an orthogonal signal set and will be considered further in
Sec. 2.5.) Then (2.3.17) becomes
P.(m>m) < e CN!) for all m’ +m
64 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and consequently (2.3.16) yields the union bound
Pp, (M4 m=1,2,...,M (2.3.19)
Thus this bound is useless when M > exp (&/2N,,). In the next section, we derive a
bound which is useful over a considerably extended range.
2.4 A TIGHTER UPPER BOUND ON ERROR PROBABILITY
When the union bound fails to give useful results, a more refined technique will
invariably yield an improved bound which is tight over a significantly wider range.
Returning to the original general expression (2.3.1), we begin by defining the
subset of the observation space
ee
A
y: > | A>0 (2.4.1)
Py(y | Xm)
d Ke | Xm)
which contains the region of summation A,,. For if z,, = 1/M for all m, then, by
the definition (2.2.8) we have for any y € A,,
m'#=m
Py (Y | Xm) = for some m” + m (2.4.2)
Pyly | Xm)
Moreover, since A > 0, raising both sides of the inequality (2.4.2) to the Ath power
does not alter the inequality, and summing over all m’ # m will include the m”
term for which (2.4.2) holds, in addition to other nonnegative terms. Hence (2.4.2)
implies
A
>’ * for alt yen (2.4.3)
ext | Xm’)
m’#m Py(y | Xm)
It then follows from (2.4.1) and (2.4.3) that every ye A,, is also in A,,, and
consequently that
An SA (2.4.4)
Thus, since the summand in (2.3.1) is always nonnegative, by enlarging the do-
main of summation of (2.3.1) we obtain the bound
Pr, < ) Pwl¥|Xm) = > S(Y)Pwly | Xm)
ycA,, y
where
1 ify eA,,
ee (2.4.5)
fly) = |
Furthermore, we have
el (extviss)
Fly) =| 2 |
m'’ =m Py(y | Xm)
A\p
| forallyeWy,p>0,A>0 8 § (2.4.6)
CHANNEL MODELS AND BLOCK CODING 65
for it follows from the definition (2.4.1) that, if y ¢ A,,, the right side of (2.4.6) is
greater than 1, while, if y ¢ A,,, the right side is at least greater than 0.
Substituting the bound (2.4.6) for f(y) in (2.4.5) yields
f.. < 2 [Pv(|%m) 7 1 > [pry | Xm’) ir] A>0,p>0 (2.4.7)
en
Since 4 and p are arbitrary positive numbers, we may choose 4 = 1/(1 + p) and
thus obtain
Pen <¥ [uly Xm)l1*| Y [v(m rf p>0 (248)
This bound, which is due to R. G. Gallager [1965], is much less intuitive than the
union bound. However, it is clear that the union bound (2.3.16) is the special case
of this bound obtained by setting p = 1 in (2.4.8). To what extent the Gallager
bound is more powerful than the union bound will be demonstrated by the
example of the next section.
2.5 EQUAL-ENERGY ORTHOGONAL SIGNALS ON THE
AWGN CHANNEL
To test the results of the preceding section on a specific signal set and channel, we
consider the most simply described and represented signal set on the AWGN
channel. This is the set of equal-energy orthogonal signals defined by the relations
[ xalesl dt = 85 = i 7 i 7 mn=1,2,...,M (25.1)
In the next section, we shall consider several examples of orthogonal signal sets.
Since the signals are already orthogonal, the orthonormal basis functions are most
conveniently chosen as
$,A(t) = arm Waite 25.3) M (2.5.2)
which clearly satisfies (2.1.2). Then the signal vector components become simply
te Sf Ok ma 1,2, ..;M (2.5.3)
and consequently the likelihood function for the AWGN channel given in (2.1.15)
becomes, with N = M,
_ exp Ee = JE)?/Nol ll exp (— yn/N.)
o n#m mN,
[att Ae a | Eoin.) wets ae
nN,) M/2
a (Y | Xm)
='€xp
66 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Substituting into (2.4.8), we obtain after a few manipulations, for every m
P, <e7#iNe [ a
afelkz. on Melt
ex
‘=m N,(1 . p
Letting z, = y,/(\/ N,/2), this becomes
é) 00 00 M e 2n7/2 p
Rs er rise —— g(Z,, Zm’)| az >0O (2.5.5
sem f” of” TT olen] X alem)] de p20 055)
where
pe
g(z) = exp | N, i+ | (2.5.6)
Since the M-fold product in (2.5.5) is the density function of M independent
normalized (zero mean, unit variance) Gaussian variables, (2.5.5) can be expressed
as
p =o (2.5.7)
Pe, <e~**Elg(zq))E 5, alen))
m' =m
where the expectation is with respect to the independent normalized Gaussian
variables z,, Z2,..., Zw. Then the expectation of (2.5.6) is readily determined to be
pee 22
= Se ee ————|d
Hole] = [ee exp ( [Fa
= 2.5.8
oP (i TF) ie
The second expectation in (2.5.7) cannot be evaluated in closed form. But it
can be upper bounded simply, provided we restrict the parameter p to lie in the
unit interval. For, by the Jensen inequality derived in App. 1B, we have for a
convex > function f(-) of a random variable ¢€
E[ f(¢)] <f(ELS]) (2.5.9)
Now letting
f= Yaltw) and f(Q)=e
which is a convex ~ function provided 0 < p < 1, we obtain from (2.5.9)
5 aem))
m'#=m
E|| 2 atem)) < (e
m' =m
=(M—1)(Elg(2)JP O<p<1 (25.10)
CHANNEL MODELS AND BLOCK CODING 67
where the equality follows because all the random variables z,,, are identically
distributed. Thus (2.5.7) becomes
Pp, < (M — 1 e~#™*(Efg(z)])!*? (2.5.11)
This bound holds uniformly for all m and hence is also a bound on P,. Finally
substituting (2.5.8) into (2.5.11), we obtain
P,; <(M — 1) exp
é({ p
Keeani,” (es Bon <4 2.5.12
wlieal| oe (
Clearly, (2.5.12) is a generalization of the union-Bhattacharyya bound (2.3.19), to
which it reduces when p = 1.
Before proceeding to optimize this bound with respect to p, it is convenient to
define the signal-to-noise parameter
é/No
CoE (2.5.13)
where S is the signal power or energy per second, and to define the rate®
parameter
R, = (In M)/T = (In q)/T, nats/s (2.5.14)
as iS appropriate since we assumed that the source emits one of g equally likely
symbols once every T, seconds. Then trivially bounding (M — 1) by M, we can
express (2.5.12) in terms of (2.5.13) and (2.5.14) as
P, < exp {—T[E,(p) — pRz]J}
where
pCr
E,(e) = ee 0< pix! (2.5.15)
The tightest upper bound of this form is obtained by maximizing the negative
exponent of (2.5.15) with respect to p on the unit interval. But, for positive p, this
negative exponent is a convex - function, as shown in Fig. 2.6, with maximum at
p =./C,/R; — 1. Thus for} < R;/C; < 1, this maximum occurs within the unit
interval; but, for R;/C, < 4, the maximum occurs at p > 1 and consequently the
negative exponent increases monotonically on the unit interval; hence, in the
latter case, the tightest bound results when p = 1. Substituting these values of p
into (2.5.15), we obtain
Peace (2.5.16)
® This is a scaling of the binary data rate for which the logarithm is usually taken to the base 2 and
the dimensions are in bits per second.
68 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Eq(p)— pRr Eo(p)— pRr
(a)1 <RplCy <1 (b)Rp/Cp <3
Figure 2.6 Negative exponent of upper bound (2.5.15).
where
4Cy — Ry 0<R,/C, <4
(/Cr—J/Rr? 4S Rr/Cr <1
For R,; > C,, the bound is useless and in fact, as will be discussed in the next
chapter, in this region, P, ~ 1 as T and M approach infinity.
The bound (2.5.16) was first obtained in a somewhat more elaborate form by
R. M. Fano [1961]. The negative exponent E(R,), sometimes called the reliability
function, is shown in Fig. 2.7. Note that the union-Bhattacharyya bound (2.3.19),
corresponding to (2.5.12) with p = 1, would produce the straight-line exponent
shown dashed in the figure. Thus the Gallager bound dominates the union bound
everywhere but at low rates, a property we shall find true for much more general
channels and signal sets.
E(R;) oe
E(Rp)ICr
Nl}
Union-
Bhattacharyya
bound
Ry/Cr
|
|
|
|
|
| J
1 1 |
4 2
Figure 2.7 Negative exponent of optimized upper bound (2.5.16).
CHANNEL MODELS AND BLOCK CODING 69
Another choice of parameters, more physically oriented than those in (2.5.13)
and (2.5.14), involves the received energy per information bit. This is defined in
terms of the system of Fig. 2.1 where g = 2, as the energy per signal normalized by
the number of bits transmitted per signal, that is,
é
= = 2.5.17
dace log, M ( )
Comparing with (2.5.13) and (2.5.14), we see that
Cr &;
oe (2.5.18)
&,/N, is called the bit energy-to-noise density ratio. Thus, (2.5.16) and (2.5.18)
together imply that, with orthogonal signals, P, decreases exponentially with T
for all €,/N, > In 2.
Ultimately, the most important consequence of (2.5.16) is that, by letting T,
and hence M, become asymptotically large, we can make P, become arbitrarily
small for all transmission rates R; < C; (or in this case, for all &,/N, > In 2).
Again, this is a fundamental result applicable to all channels. However, making T
very large may be prohibitive in system complexity. In fact, as will be shown in the
next section, this is always the case for orthogonal signals. The major part of this
book deals with the problem of finding signal sets, or codes, and decoding
techniques for which system complexity remains manageable as T and M increase.
2.6 BANDWIDTH CONSTRAINTS, INTERSYMBOL
INTERFERENCE, AND TRACKING UNCERTAINTY
Up to this point, the only constraint we have imposed on the signal set was the
fundamental one of finite energy. Almost as important is the dimensionality con-
straint imposed by bandwidth requirements. The only limitation on dimensional-
ity discussed thus far was the one inherent in the fact that M signals defined over
a T-second interval can be represented using no more than M orthogonal basis
functions, or dimensions, as established by the Gram-Schmidt theorem (App. 2A).
These orthogonal functions (or signal sets) can take on an infinite multitude of
forms. Four of the most common are given in Table 2.1. The orthonormal relation
(2.1.2) can be verified in each case. An obvious advantage of the orthonormal set
of Example 1 is that, as contrasted with the general modulator and demodulator
of Fig. 2.1, only a single modulator and demodulator element need be imple-
mented, for this can be time-shared among the N dimensions, as shown in Fig. 2.8.
The observables {y,} then appear serially as sampled outputs of the integrator.
These are generated by a device which integrates over each symbol period of
duration T/N, is sampled, dumps its contents, and then proceeds to integrate over
the next symbol period, etc. The orthonormal set of Example 2 requires two
(L/uz jo aJdnjnw e °@)
(1 /z jo gdnjnw e %@)
(L/NuZ Jo sdnnu ev °@)
(L/N2 Jo aJdnynu ve %o)
ISIMIOYIO
N/LU>15 N/L(I — 4)
ISIMIOY}O
N/LY >15 N/L(I — 4)
eSIMIOYJO
N/LU>35 N/L(I — 4)
suoljouny sseyd
-oinjeipenb ;euosoyjjo-Aduanbaly “p
suoljounj [eUOSOYIIO-ADUONbely “¢
suonouny aseyd
-o1njeipenb [euosoY}IO-sWI] “7
suolouny [BUOSOYIIO-sUI] “|
N7] | le
“
0 L > (9)
1 °@ uls ee | (2)“o
uonounf
sue
SJIS [VUSIS [VULIOUOYIIO JO Ssoyduiexy [°7 Fquy
70
CHANNEL MODELS AND BLOCK CODING 71
Xmn Xm (t) y(t) nT/N of Vn
ee
(n-1)T/N
o, (t) ob, (t)
(a) Modulator during (b) Demodulator during
nth subinterval nth subinterval
(n-1)T/N<t<nT/N n=1,2,...,N
Figure 2.8 Modulator and demodulator for time-orthogonal functions.
modulator—demodulator elements, as shown in Fig. 2.9, which are generally called
quadrature modulator-demodulators. On the other hand, 3 and 4 would seem to
require a full bank of N demodulating elements.*°
It is well known that the maximum number of orthogonal dimensions trans-
mittable in time T over a channel of bandwidth W is approximately
N ~2WT (2.6.1)
The approximation comes about because of the freedom in the definition and
interpretation of bandwidth. To illustrate, we begin by giving a simplistic inter-
pretation of bandwidth. Suppose all communication channels on all frequency
bands are operating on a common time scale and using a common set of orthog-
onal signals, such as the frequency-orthogonal functions of Example 3. Then,
depending on its requirements, a channel would be assigned a given number N of
basis functions which are sinusoids at consecutive frequency multiples of 2/T
‘© In fact, there exist both analog and digital techniques for implementing the entire bank with a
single serial processing device (Darlington [1964], Oppenheim and Schafer [1975)).
nT/N
anne: cai
(n-1)T/N “an
bon it) (n-1)T/IN<t<nT/N n=1,2,...,N
i Nase onats
nT/N
f
be y.
(n-1)T/N 72n+1
$on+1(t)
Figure 2.9 Demodulator for time-orthogonal quadrature-phase function.
72 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
radians per second. Given that these were processed ideally by the demodulator of
Fig. 2.1b, all other channels would have no effect on the given channel’s perfor-
mance since the basis functions of the other channels are orthogonal to those
assigned to the given channel and consequently would add zero components to
the integrator outputs y,, y>, ..., yy of the demodulator of Figure 2.1b. Now
suppose we defined the channel bandwidth occupancy as the minimum frequency
separation between the basis functions of the Example 3 signal set times the
number of functions utilized by the channel. Then since the former is 2/T radians
per second or 1/(2T) Hz, for a number of dimensions N, the bandwidth occupancy
W in Hz is given exactly by (2.6.1). We note also that, if the frequency separation
were any less, the functions would not be orthogonal.
The same argument can be made for the time-orthogonal functions of
Example | provided we take wy to be a multiple of xN/T. Then it is readily verified
that, where the waveforms of any two channels overlap for a time interval T/N,
they are orthogonal over this interval and consequently the demodulator of one
channel is unaffected by the signals of the other. Thus the separation between
channels is exactly xN/T radians per second or N/2T Hz, again verifying (2.6.1).
In Examples 2 and 4 two phases (sine and cosine) are used for each frequency, but
as a result, consecutive frequencies must be spaced twice as far apart; hence the
bandwidth occupancy is the same as for 1 and 3.
The weakness in the above arguments, aside from the obvious impossibility of
regulating all channels to adopt a common modulation system with identical
timing, is that, inherent in the transmitter, receiver, and transmission medium,
there is a linear distortion which causes some frequencies to be attenuated more
than others. This leads to the necessity of defining bandwidth in terms of the signal
spectrum. The spectral density of the transmission just described is actually non-
zero for all frequencies, although its envelope decreases in proportion to the
frequency separation from @,. This, in fact, is a property of all time-limited
signals.
On the other hand, we may adopt another simplistic viewpoint, dual to the
above, and require that all our signals be strictly bandwidth-limited in the sense
that their spectral density is identically zero outside a bandwidth of W Hz. Then,
according to the classical sampling theorem, any signal or sequence of signals
satisfying this constraint can be represented as
2a /2 sin [nW(t — n/W)]
eatin mW(t — n/W)
(a, SiN @pt +b, COS Wot) (2.6.2)
This suggests then that any subset of the set of band-limited functions
Ne it sin [xW(t — n/W)]
eT a
JW sin [a n any integer (2.6.3)
ere sist ek es
CHANNEL MODELS AND BLOCK CODING 73
Figure 2.10 Envelope of ¢,,(t) and $>,.,(t) of
(2.6.3).
can be used as the basis functions for our transmission set. It is readily verified
that the functions are orthonormal over the doubly infinite interval, i.e., that
J dfoddult) at = 5
As shown in Fig. 2.10, the envelope of both @,,,(t) and @,,, ,(t) reaches its peak at
= n/W and has nulls at all other multiples of 1/W seconds. Furthermore, the
functions (2.6.3) can be regarded as the band-limited duals of the time-orthogonal
quadrature-phase orthonormal functions of Example 2, where we have exchanged
finite time and infinite bandwidth for finite bandwidth and infinite time. Another
interesting feature of this set of band-limited basis functions is that the demodula-
tor can be implemented as a pair of ideal bandpass filters (or quadrature multi-
pliers and ideal lowpass filters), sampled every 1/W seconds, producing at
t = n/W the two observables y,,, and y>,,, (see Fig. 2.11; also Prob. 2.6). Thus
again it appears that we can transmit in this way 2W dimensions per second so
that, as T— oo where we can ignore the slight excess time-width of the basis
functions, (2.6.1) is again satisfied.
Sample at n/W
LPF A > Yan
V2 sin Wot (Ideal lowpass filters
VO 9 with bandwidth W/2)
LPF ~ > Y2n+1
V2 cos wot
Figure 2.11 Demodulator for functions of Eqs. (2.6.3).
74 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
On the basis of (2.6.1) we may draw a conclusion about the practicality of the
orthogonal signal set whose performance was analyzed in Sec. 2.5. There we found
that the error probability decreases exponentially with the product TE(R,), but
it follows from (2.5.14) that the number of signals, and therefore orthogonal
dimensions, is N = M = e’®™. Consequently, according to (2.6.1), we find that, for
orthogonal signals,
W > oF 81/2 T (2.6.4)
which implies that, for all Ry; > C7/4, the bandwidth grows more rapidly with T
than the inverse error probability. This exponential bandwidth growth is a severe
handicap to the utilization of such signal sets. We shall find, however, in the next
chapter that there exist codes or signal sets whose dimensionality grows only
linearly with T and yet which perform nearly as well as the orthogonal set.
The impossibility of generating functions which are both time-limited and
band-limited has led to many approaches to a compromise (Slepian and Pollack,
[1961], Landau and Pollack [1961, 1962]). In terms of the previous discussions, we
may generalize on the time-orthogonal functions of Table 2.1 (Examples 1 and 2)
by multiplying all the functions in question by an envelope function f(t — nT/N)
with the property that
FP r(e- 92) p(e- 22) dt = bn (2.6.5)
to obtain
dult) = /2p(t— 2) sin Wo t a ae (2.6.6)
and
Equation (2.6.7) includes as a special case the band-limited example of (2.6.3)
where the envelope function is taken to be
f()=/W aah (2.6.8)
Typically, however, f(t) is chosen to be time-limited, though not necessarily to
T/N seconds, and, though of infinite frequency duration, its spectrum decreases
much more rapidly than 1/W.
CHANNEL MODELS AND BLOCK CODING 75
The choice of envelope function, also called the spectrum shaping function, is
not made on the basis of signal spectrum alone. For bandwidth is never an end
unto itself; rather, the goal is to minimize interference and linear distortion in-
troduced by the channel. Thus, even if f(t) is the ideal band-limited function of
(2.6.8) and the demodulator contains ideal lowpass filters (as shown in Fig. 2.11),
the transmitter, transmitting media, and receiver introduce other (non-ideal)
linear filtering characteristics which distort the waveform, so that the signal com-
ponent of the received waveform is no longer exactly f(t). As a result, we no longer
have the orthogonality condition (2.6.5) among the signals for successive dimen-
sions and the demodulator output for a given dimension is influenced by the
signal component of adjacent dimensions. This phenomenon is called intersymbol
interference. The degree of this effect depends on the bandwidth of the filters, or
linearly distorting elements, in the transmitter, receiver, and medium. Only when
the bandwidth of these distorting filters is on the order of that of f(t) does this
become a serious problem. In such cases, of which data communication over
analog telephone lines is a prime example, spectrum shaping functions are chosen
very carefully to minimize the intersymbol interference. Also, with intersymbol
interference present, the demodulator of Fig. 2.1b is no longer optimum because
of the nonorthogonality of signals for successive dimensions. Optimum demodu-
lation for such channels, which has been studied extensively (Lucky, Salz,
Weldon [1968], Forney [1972], Omura [1971]), leads to nonindependent observ-
ables. In this chapter and the next we shall avoid the problem of intersymbol
interference, by assuming a sufficiently wideband channel. In Chap. 4, we return
to this issue and treat the problem as a natural extension of decoding techniques
developed in that chapter.
Additional sources of imperfection arise because of uncertainties in tracking
carrier frequency and phase, and symbol timing. For the time-orthogonal func-
tions (Example 1 of Table 2.1), uncertainty in phase or frequency will cause the
demodulator to attenuate the signal component of the output. For example, if the
frequency error is Aw and the phase error is @, the attenuation factor is easily
shown to be approximately
cos ¢ sin (TAw/N)
TAw/N
provided we take T(Aw)/N <1 and Two/N > 1 (see Prob. 2.7). For time-
orthogonal quadrature-phase functions (Example 2 of Table 2.1), the situation is
aggravated by the fact that incorrect phase causes intersymbol interference be-
tween the two dimensions which share a common frequency. For with a phase
error @, the signal component jy,, is proportional to x,, cos @ + X2,4, sin ¢,
while that of y2,,, is proportional to x,,, cos @ — x2, sin @ (see Prob. 2.7).
Finally, symbol time uncertainty will cause adjacent symbol overlap during
presumed symbol times and hence intersymbol interference. The influence of all
these imperfections on demodulation and decoding has been treated in the appli-
cations literature (Jacobs [1967], Heller and Jacobs [1971]).
76 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
2.7 CHANNEL INPUT CONSTRAINTS
The last section treated the causes of performance degradation which arise in the
channel, comprising the modulator, transmitter, medium, receiver and demodula-
tor (Fig. 2.1). These imperfections and constraints are inherent in the continuous
or analog components of the channel which, as we noted, are not easily control-
lable. In contrast, we now consider constraints on the channel inputs and outputs
imposed by limitations in the encoder and decoder. Such limitations, which may
lead to suboptimal operation, are imposed whenever the encoder and decoder are
implemented digitally. In most cases, they produce a very small degradation
which can be very accurately predicted and controlled.
A digital implementation of the encoder requires that the encoder output
symbols {x,,,,} be elements of a finite alphabet. The most common and simplest
code alphabet is binary. For an AWGN channel with binary inputs, for any m and
n; the choice xs. = + fs (where &, is the energy per channel symbol) guarantees
a constant-energy transmitted signal. The binary choice can be implemented either
by amplitude modulation (plus or minus amplitude), or by phase modulation (0°
or 180° phases) of any of the basis function sets discussed in the last section. When
used with time-orthogonal functions (Table 2.1, Example 1), this is usually referred
to as biphase modulation; when used with time-orthogonal quadrature-phase
functions (Table 2.1, Example 2), this is usually called quadriphase modulation.
The reason for the latter term is that two successive encoded symbols generate
the modulator output signal in the single interval (n — 1)T/N <t < nT/N, that is,
ee
X2n P2n(C) + Xan+ 1Prn+ (t) Pai 26, fi (+ SIN Wo t + cos Wo t)
sin (Wot + 9,)
where 0, = 1/4 + kx/2, k = 0, 1, 2, or 3.
Note that this results in twice the symbol energy of biphase modulation, but it
is spread out over twice the time, since two code symbols are transmitted ; hence the
signal energy per symbol and consequently the power is the same. We note also
that, as shown in Sec. 2.2, the demodulator outputs are the same in both cases,
and consequently the performance is identical.'?
An obvious disadvantage of a binary-code symbol alphabet is that it limits the
number of messages which can be transmitted with N dimensions, or channel
symbols, to M<2% and hence constrains the transmission rate to
R;<(N/T) In 2 nats/s. We may remove this limitation by increasing
! Provided of course the phase tracking errors are negligible; otherwise the intersymbol interfer-
ence from the quadrature component, as discussed in Sec. 2.6, can degrade performance relative to the
biphase case.
CHANNEL MODELS AND BLOCK CODING 77
the code alphabet size to any integer q, although, for efficient digital implementa-
tion reasons, q is usually taken to be a power of 2. Then M <q and
R,; < (N/T) In q nats per second, which can be made as large as desired or as
permitted by the channel noise, as we shall find. As an aside, we note that M = N
orthogonal signals can always be implemented as biphase-modulated time-
orthogonal basis functions, whenever N is a multiple of 4 (see Prob. 2.5 for
N=?*, K> 2).
For the time-orthogonal waveforms 1 or 2 of Table 2.1, the modulator for
g-ary code symbols is commonly implemented as a multiple amplitude modulator.
For example, with a four-symbol alphabet, the modulator input symbols might be
chosen as {a;, da, a3, a4}. With equiprobable symbols and a, = —a, =a,
a, = —a, = 3a, the average symbol energy is &, = 5a”. Of course, a disadvantage
is that the transmitted power is no longer constant. A remedy for this is to use
multiphase rather than multiamplitude modulation. This is easily conceived as a
generalization of the frequency-orthogonal quadrature-phase basis set of Table
2.1, Example 4. A 16-phase modulation system would transmit a symbol from a 16-
symbol alphabet as
vile pipet
ait et a 3
é, ps a
_ + wale Maas Wot +
De.
r
ee
sin cos wo O02 <5
where k = 0, 1, 2, ..., 15 and L= 16. We note, however, that this requires two
dimensions per symbol so that, in terms of bandwidth or dimensionality, this
16-symbol code alphabet simultaneously modulating two dimensions of a time-
orthogonal quadrature-phase system is equivalent to the four-symbol alphabet
amplitude modulating one dimension at a time. The signal geometry of the two
systems for equal average symbol energy &, is shown in Fig. 2.12. It is easily
X X3at X Xx y— k—~y
te
x x
is -
X X at X xX | oye
oe a 34 1 }
X toe tee X x
/
Nyx A
X X-3a+ . X Xx Teer
(a) Multiamplitude signal set in two dimensions (5) Multiphase signal set in two dimensions
Figure 2.12 Two examples of 16 signals in two dimensions.
78 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
shown (Prob. 2.8) that, for equal &,, the amplitude modulation system outper-
forms the phase modulation system, but the latter has the advantage of constant
energy. Obviously, we could generalize to signals on a three-dimensional sphere
or higher, but for both practical and theoretical reasons to be discussed in the next
chapter, this is not profitable.
2.8 CHANNEL OUTPUT QUANTIZATION: DISCRETE
MEMORYLESS CHANNELS
We now turn to limitations imposed by the digital implementation of the decoder.
Considering first the AWGN optimum decoder (Fig. 2.4), we note the obvious
incentive to implement digitally the discrete inner-product calculations (x,,, y) =
yn=1 Xmnn- While the input symbols {x,,,} are normally elements of a finite set as
discussed in the last section, the outputs {y,} are continuous Gaussian variables
and consequently must be quantized to a finite number of levels if digital multipli-
cations and additions are to be performed. An example of an eight-level uniform
quantizer is shown in Fig. 2.13. Uniform quantizers are most commonly
employed, although nonuniform quantization levels may improve performance
to a slight degree.
The performance of a quantized, and hence suboptimum, version of the opti-
mum decoder of Fig. 2.4 is difficult to analyze precisely. On the other hand,
quantization of the output to one of J levels simply transforms the AWGN
channel to a finite-input, finite-output alphabet channel. An example of a biphase
modulated AWGN channel with output quantized to eight levels is shown in
Fig. 2.14. Denoting the binary input alphabet by {a,, a,} where a, = —a, = Jé;
and denoting the output alphabet by {b,, b,, ..., bg}, we can completely describe
Output
b, +
by +
b3 +
ae Dad
+ +——++ + + t Input
me! et? fbsa 2a 3a
a he be
+ b,
+ bg
Figure 2.13 Uniform eight-level quantizer.
CHANNEL MODELS AND BLOCK CODING 79
y(t) = V 2&,/T sin wot + n(t)
Uniform Vi>V20--+5) Vn
ve — >| eight-level Seba Cera!
: quantizer
Sampled
every JT seconds
V 2/T sin wot
(a) Quantized demodulator for binary PSK signals
p(b; lay )
by
Be b,
ay =+ V&, b;
by
bs
af b¢
a> a &s b
;
bg
p(bg la>)
(b) Quantized channel model
Figure 2.14 Quantized demodulator and channel model.
the channel by the conditional probabilities or likelihood functions
N
MAt 1 ated mS, M
n=1
where for each m and n
| e7 @—aK)2/No dz [k= 1,2 (2.8.1)
(4 = By mn = 4) = a=
eee a oe, Pree ta ce
and B; is the jth quantization interval. We note that, while a, can actually be
associated with the numerical value of the signal amplitude, b; is an abstract
symbol. Although we could associate with b; the value of the midpoint of the
interval, there is nothing gained by doing this. More significant are the facts that
the vector likelihood function can be written as the product of symbol conditional
probabilities and that all symbols are identically distributed. In this case, of
course, this is just a consequence of the AWGN channel for which individual
observables (demodulator outputs prior to quantization) are independent. A
channel satisfying these conditions is called memoryless, and when its input and
output alphabets are finite it is called a discrete memoryless channel (DMC) (cf.
Sec. 1.2). Other examples of discrete memoryless channels, derived from physical
channels other than the AWGN channel, will be treated in Sec. 2.12. Figure 2.145
completely describes the DMC just considered in terms of its binary-input, octal-
output, conditional probability distribution, sometimes called the channel transi-
80 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Bowe its
1 a
a, . 2 b, Figure 2.15 Binary-symmetric channel.
tion distribution. Clearly, this distribution, and consequently the decoder
performance, depends on the location of the quantization levels, which in turn
must depend on the signal level and noise variance. Thus, to implement an effec-
tive multilevel quantizer, a demodulator must incorporate automatic gain control
(AGC).
The simplest DMC is the one with binary input and output symbols, which
may be derived from a binary-input AWGN channel by utilizing a two-level
quantizer. The quantizer, whose output is b, for nonnegative inputs and b, other-
wise, is generally called a hard quantizer (or limiter), in contrast with a multilevel
quantizer which is usually called a soft quantizer. The resulting hard-quantized
output channel is the binary-symmetric channel (BSC). When derived from the
AWGN channel, the BSC has the conditional distribution diagram shown in Fig.
2.15 with p = p{y = b2|x = a,} = p{y = b, |x = a}}, generally called the crossover
probability, being the same as the symbol error probability for an uncoded
digital communication system. The principal advantage of hard quantizing the
AWGN channel into a BSC is that no knowledge is needed of the signal energy.
In contrast, as commented above, the soft quantizer requires this information
and hence must employ AGC. On the other hand, as will be elaborated on in
Sec. 2.11 and the next chapter, the hard quantizer considerably degrades per-
formance relative to a properly adjusted soft quantizer.
With quadriphase modulation, demodulated by the system of Fig. 2.9, the
same quantization scheme can be used on each of the two streams of observables,
resulting in exactly the same channel as with biphase modulation, provided we
can ignore the quadrature intersymbol interference discussed in Sec. 2.6. Multi-
amplitude modulation can be treated in the same way as two-level modulation.
For the case of Q input levels and J-level output quantization, the AWGN
channel is reduced to a Q-input, J-output DMC. With Q-phase multiphase modu-
lation employing both quadrature dimensions, as shown in case b of Fig. 2.12, the
quantization may be more conveniently implemented in phase rather than
amplitude.
Once the AWGN channel has been reduced to a DMC by output quantiza-
tion, the decoder of Fig. 2.4, or its digital equivalent operating on quantized data,
is no longer optimum. Rather, the optimum decoder must implement the decision
rule (2.2.7) which is optimum for the resulting memoryless channel. For
equiprobable messages, this reduces to the maximum likelihood decoder or
decision rule
N
Hy, ny A, if y [In P(Yn| Xmn) — In P(Yn|Xm’n)] >0 for all m’ = m (2.8.2)
n=1
CHANNEL MODELS AND BLOCK CODING 81
where for each m and n
Meee Cs Bis: <5 Ag}
Yn € {by, bz, ..., by}
For the BSC, (2.8.2) reduces to an even simpler rule. For, as shown in Fig. 2.15,
the conditional probability for the nth symbol is p if y, # Xm», and is (1 — p) if
Vn = Xmn- Suppose that the received vector y = (y,, ..., yy) differs from a trans-
mitted vector x,, = (X14, ---, Xmn) in exactly d,, positions. The number d,, is then
said to be the Hamming distance between vectors x,, and y. The conditional
probability of receiving y given that x,, was transmitted is
pv(y IX») = [] pla Xnu) = pA(1 =p 283)
Note that, because of the symmetry of the channel, this likelihood function does
not depend on the particular value of the transmitted symbol, but only on whether
or not the channel caused a transition from a, to b, or from a, to b,. Thus
In rl P(Yn| Xmn) = yi P(Yn| Xmn)
n=1
= N In (1 — p) —d,, In [(1 — p)/p] (2.8.4)
Substituting (2.8.4) into (2.8.2), we obtain for the BSC the rule
He= HH. if ( — dy) n (*=?) > o for all m’ +m
Without loss of generality, we may assume p < 5 (for if this is not the case, we may
make it such by just interchanging the indices on b, and b,). Then the decoding
rule becomes
1; Qe ; oe if d,, < dy’ for all m’ +m (2.8.5)
where d,, is the Hamming distance between x,, and y. In each case, ties are
assumed to be resolved randomly as before.
Hence, we conclude that, for the BSC, the maximum likelihood decoder re-
duces to a minimum distance decoder wherein the received vector y is compared
with each possible transmitted signal vector and the one closest to y, in the sense
of minimum number of differing symbols (Hamming distance), is chosen as the
correct transmitted vector. Although this suggests a much simpler mechanization,
this rule could be implemented as in Fig. 2.4 if we took y and x,, to be binary
vectors and a, = b, = +1 anda, =b, = —1.
For discrete memoryless channels other than the BSC, the decoding rule
(2.8.2) can be somewhat simplified in many cases (see Prob. 2.9), but usually not
to the point of being independent of the transition probabilities as has just been
shown for the BSC. Generally, the rule will depend on these probabilities and
hence on the energy-to-noise ratios as well as on the quantization scheme used.
82 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
This leads to a potential decoder mismatch or suboptimality due to unknown
signal levels (incorrect AGC) or noise variance. Also, since the transition probabil-
ities are themselves real numbers, quantization of these is required to implement
the rule of (2.8.2) digitally with a resulting minor degradation. As we shall discover
in later chapters, some decoders are relatively insensitive to channel statistics,
while others degrade rapidly as a function of decoder mismatch. However, it is
generally true that, for binary inputs, even with a mismatched decoder, perfor-
mance of a multilevel (soft) quantized channel decoder is superior to that of a
two-level (hard) quantized channel decoder. In performance evaluation of binary-
input channels with variable output quantization, we shall generally treat the
limiting cases of an AWGN channel without quantization and with hard quanti-
zation (BSC) to establish the two limits. Some intermediate cases will also be
treated to indicate the rate of approach of multilevel (soft) quantization to the
unquantized ideal case.
2.9 LINEAR CODES
Thus far we have devoted considerable attention to all parts of the communica-
tion system except the encoder. In its crudest form, encoding can be regarded as a
table-look-up operation; each of the M signal vectors x,, X,,..., Xy 1S stored in an
N-stage register of a memory bank and, whenever message H,, is to be trans-
mitted, the corresponding signal vector x,, is read into the modulator. Alterna-
tively, we may label each of the M = qg* messages, as in Fig. 2.1, by a K-vector
over a q-ary alphabet. Then the encoding becomes a one-to-one mapping from the
set of message vectors {u,, = (Um1,---;Umxk)} into the set of signal vectors
{Xin = (Xn1> -++> Xmy)}. We shall concern ourselves primarily with binary alphabets;
thus initially we take u,,, € {0, 1}, for all m, n and generalize later to q>2. A
particularly convenient mapping to implement is a linear code. For binary-input
data, a linear code consists simply of a set of modulo-2 linear combinations of the
data symbols, which may be implemented as shown in Fig. 2.16. The K-stage
register corresponds precisely to the data block register in the general system
diagram of Fig. 2.1. The coder then consists of L modulo-2 adders, each of which
adds together a subset of the data symbols u,,1, Un2,---» Umx to generate one code
symbol v,,, Where n = 1,2,..., L as shown in Fig. 2.16. We shall refer to the vector
Vin = (Umi1s Um2> +++» Um) aS the code vector. Modulo-2 addition of binary symbols
will be denoted by @ and is defined by
0@1=1@0=1
0@0=1@1=0 (2.9.1)
It is readily verified by exhaustive testing that this operation is associative and
commutative; that is, if a, b, c are binary symbols (0 or 1), then
(a®@b)@c=a@(b@c) (2.9.2a)
CHANNEL MODELS AND BLOCK CODING 83
~ K stages
Xm
—————
Binary x
symbol tax
to
channel
symbol
mapping 9)
>
L modulo-2 adders
Figure 2.16 Linear block encoder.
and
a®b=b@a (2.9.2b)
Thus the first stage of the linear coding operation for binary data can be repre-
sented by
Umi = Umi 911 B Um2921 B°** B Unk Gx1
Um2 = Um1 912 B Um2922 B*** B Unk Ix2
66 £6 eG CE PW 6 Sre eo 82S. 6 0 2 ES oS Se 6 6S 8 Oe 6 6
Une = Um1 Jit B Um2921 B °°’ B Umk OK (2.9.3)
where g,,, € {0, 1} for all k, n. The term u,,, g;,, is an ordinary multiplication, so that
Un» Enters into the particular combination for p,,,, if and only if g,,, = 1. The matrix
. fie tC, aie © eas
eee + —
A gears = ale (2.9.4a)
Gxt: Gee "Gxt Oo * eimed
is called the generator matrix of the linear code and {g;} are its row vectors. Thus,
(2.9.3) can be expressed in vector form as’?
K
¥,=—a,G= Sage (2.9.4b)
k=1
‘1 '§ means modulo-2 addition.
84 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where both u,, and v,, are binary row vectors. Note that the set of all possible
codewords is the space spanned by the row vectors of G. The rows of G then form
a basis with the information bits being the basis coefficients for the codeword.
Since the basis vectors are not unique for any linear space, it is clear that there are
many generator matrices that give the same set of codewords.
To complete the linear encoding, we must convert the L-dimensional code
vector v,, With elements in {0, 1} into the N-dimensional real number signal vector
Xm = (Xm1> Xm2>+++> Xmn)- For the simplest cases of biphase or quadriphase modu-
lation, we need only a one-dimensional mapping
Umn = 0 Xmn = +/ 6,
Umn = 1 > Xin = —~/é,
so that, in fact, L= N. For more elaborate modulation schemes, we must take
L> N. For example, for the four-level amplitude modulation scheme of Fig. 2.12a
we need to take L=2N. Then the four possible combinations of the pair
(Unt, Um,1+1) (where | is odd) give rise to one of four values for the signal (ampli-
tude) symbol x,,, [where n = (J + 1)/2]. Similarly, for the 16-phase modulation
scheme of Fig. 2.12b, we must take L = 4N and use four consecutive v-symbols to
select one of the 16-phase x-symbols.
Before considering the code or modulation space further, we shall examine an
extremely important property of linear codes known as closure: namely, the
property that the modulo-2 termwise sum of two code vectors v,, and V,
(2.9.5)
Vin D Vk = (Um1 © Vers Vm2 B Ver» -++> Umt © Ver)
is also a code vector. This is easily shown, for, by applying the associative law to
Vn =U, G and v, = u,G, we obtain
Vn DY, = U,G O@u,G
= (u,, ® u,)G
But since u,, and u, are two K-dimensional data vectors, their modulo-2 sum must
also be a data vector, for the 2* data vectors must coincide with all possible binary
vectors of dimension K. Thus, denoting this data vector u,, ® u, =u,, it follows
that
Vn © V, = U,G
ih F (2.9.6)
which is, therefore, a code vector. We generally label the data vectors consecu-
tively with the convention that u, = (0, 0, ..., 0) = 0. It follows from (2.9.4) that
also v, = 0. The vector 0 is called the identity vector since, for any other code
vector,
Vm DY: = Vn © O
=v (2.9.7)
CHANNEL MODELS AND BLOCK CODING 85
We note also that as a consequence of (2.9.1)
Vin BD Vm = 0 (2.9.8)
which means that every vector is its own negative (or additive inverse) under the
operation of modulo-2 addition. When a set satisfies the closure property (2.9.6),
the identity property (2.9.7), and the inverse property (2.9.8) under an operation
which is associative and commutative (2.9.2), it is called an Abelian group. Hence
linear codes are also called group codes. They are also called parity-check codes,
since the code symbol u,,,, = 1 if the “ parity” of the data symbols added to form
Umn 1S Odd, and v,,,, = 0 if the parity is even.
An interesting consequence of the closure property is that the set of Hamming
distances from a given code vector to the (M — 1) other code vectors is the same
for all code vectors. To demonstrate this, it is convenient to define first the Ham-
ming weight of a binary vector as the number of ones in the vector. The Hamming
distance between two vectors v,, and v,, is then just the Hamming weight of their
modulo-2 termwise sum, denoted w(v,, @ V,,’). For example, if
Wem PLOT) and we O11 2)
then
w(1 101 1)
=4
W(Vin ® Vn’)
which is clearly the number of differing positions and hence the Hamming dis-
tance between the vectors. Now the set of distances of the other code vectors from
v, = 0 is clearly {w(v>), w(v3), ..., w(Vaq)}. On the other hand, the set of distances
from any code vector v,, #0 to the other code vectors is just {w(v,, ® V,,’): all
m’ + m}. But, by the closure property, V,,, © V,, is some code vector other than y,.
Furthermore, for any two distinct vectors v,,,, V,,, Where m’ + m, m” + m, and
m' + m” we have
Vin’ © Vin = Vin” ® Vin
and
Vin’ ® Vin = 0= Vi
Hence, as the index m’ varies over all code vectors other than m, the operation
Vn’ ® Vm generates all (M — 1) distinct nonzero code vectors and consequently the
entire set except v,. It follows that
{Vin BV: all m’ + m} = {v2, V3, ---> Var} (2.9.9)
and thus also that
{W(Vinv B Vin): all m’ + m} = {w(v2), w(v3) ... w(¥a,)} (2.9.10)
which means that the set of distances of all other code vectors from a given code
vector v,, is the same as the set of distances of all code vectors from v,. Thus,
86 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
without loss of generality, we may compute just the distances from v, or, equiva-
lently, the weights of all the nonzero code vectors.
Another very useful consequence of the closure property of linear codes is
that, when these are used on a input-binary, output-symmetric channel with maxi-
mum likelihood decoding, the error probability for the mth message is the same for
all m; that is,
P, =P; MA; 2M (2.9.11)
as we next show.
A binary-input symmetric channel, which includes the biphase and quadri-
phase AWGN channels as well as all symmetrically quantized reductions thereof,
can be defined as follows. Let, for each m and n,
P(Yn|Xmn = +4/65) = P(Vn| Onn = 9)
P(Yn|Xmn = —r/ 6s) = P(Vn| Umn = 1) 5a
= PilVn)
This binary-input channel is said to be symmetric if
Pily) = Po(—y) (2.9.13)
It is easily verified that the AWGN channel, the BSC,'? and any other symmet-
rically quantized AWGN channel all satisfy (2.9.13). To prove the uniform error
property (2.9.11) for a binary linear code on this class of channels using maximum
likelihood decoding, we note, using (2.3.1), (2.3.3), and (2.9.1) that
Pre = Pr {ye An|Xn}
— d Ply | Xm)
yeA,,
where
An = {y: In py(y|Xm) > In py(y|X,) for some m' + m}
P(Vn| Onin a 1)
In
a P(Yn| Umn ie 0)
Um'n= ui
y;
P(Yn| Um'n on 0)
+ In
n: 2 P(Vn| Umn ae
Umn= 1
> 0 for some m' +m
Po(— Yn)
Y hh. —
: fae Po(Yn)
Um'n=1
ey ie Polya) > 0 for some m' + m (2.9.14a)
Nn: Um'n=0 Po(—Yyn)
Umn= 1
'? For the BSC we must use the convention “0” > +1 and “1” > —1 so that y= +1 or —1 in
order to use the definition (2.9.13) of symmetry.
CHANNEL MODELS AND BLOCK CODING 87
We have
Pr = YE I] P(Yn|Umn = 9) I] P(Yn| Onn = 1
y € Am n: Umn=0 nN: Umn= 1
32 22 [I Polyn) [] Pol — (2.9.15)
y€Am n: Umn=0 n: Umn=
But if we let
Zz. = jy. for all n such that v,,,, = 0
Z,= —Vp for all n such that »,,, = 1
which is just a change of dummy variables in the summation (or integration),
(2.9.15) and (2.9.14) become respectively
pa TT po(z) (2.9.15b)
zeAm n=1
where now
A, =z: yn Po(= 7») > 0 for some m’ + m (2.9.14)
Nn: Um'n= Umn Po(Zn)
But comparing (2.9.14b) and (2.9.15b) with (2.9.14a) and (2.9.15a), respectively, with
m = 1 (v,,, = 0 for all n) in the latter pair, we find that, because of the symmetry of
the linear code (2.9.9) and the random resolution of ties (see Sec. 2.2)
Py = Pe, for m= a Pay (2.9.16)
Thus not only are all message error probabilities the same, but, in calculating P,;
for linear codes on binary-input symmetric channels, we may without loss of
generality assume always that the all-zeros code vector was transmitted. This
greatly reduces the effort and simplifies the computations.
As an example, consider a linearly coded biphase-modulated signal on the
AWGN channel. Although computation of the exact error probability is generally
prohibitively complicated (except for special cases like the orthogonal or simplex
codes; see Probs. 2.4, 2.5), the union upper bound of Sec. 2.3 can easily be cal-
culated if the set of weights of all code vectors is known. For from (2.3.4) and
(2.3.10), we obtain that, for the AWGN channel with biphase modulation
sk [Xx — 1 !
Py= Pr, < sol /J2N,
where ||x, — x, || is the Euclidean distance between signal vectors. Now suppose
the weight of v,, which is also its Hamming distance from v,, is w,. This means
that w, of the code symbols of v, are ones and consequently that w, of the code
symbols of x, are —/J/é, (the remainder being +/é, ), and of course all code
symbols of x, are +/é, since v,; = 0. Thus
|x; =e X; | = 2. / E Wk
and, consequently, for the biphase- (or quadriphase-) modulated AWGN channel
88 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
we have the union bond
=
Pe < ¥ O(/@8.IN,Jm) (2.9.17)
=2
We may readily generalize (2.9.17) to any binary-input symmetric channel by
using the Bhattacharyya bound (2.3.15) in conjunction with the union bound
(2.3.4). For memoryless channels, (2.3.15) becomes
r(l>k)< < TE Vel le )p(y|vin = 9)
Il ZA (y| Un = 0)p (y|v1, = 0) I] pa J P(Y| Pen = 1)p(y|v,, = 0)
nN: Vkn= nN: Vkn=1 y
I] TVA (y|Uin = 1)p(y|vin = 9) (2.9.18)
Nn: UknFUin y
where the last step follows since each sum in the first product equals unity. Since
Vin # V1, In exactly w, positions we have
ae)
and hence from (2.3.4) we find’? bas
Pe< 5 exp min Y Vo Ps] (29.19)
rlok)<
where —In ), s/f. Po(y)pi(y) is the previously defined Bhattacharyya distance
which becomes &,/N, for the AWGN channel [see (2.3.17) and (2.9.17)]. We note
also that for the BSC
—In ¥ /po(y)pi(y) = —In ,/4p(1 — p)
where p is the crossover probability. Tighter results for the BSC will be obtained
in the next section.
In principle, we could employ the tighter Gallager bound of (2.4.8), but this
generally requires more knowledge of the code structure than just the set of
distances between code vectors. In fact, even the set of all code vector weights is
not easily calculated in general. Often, the only known parameter of a code is the
minimum distance between code vectors. Then from (2.9.17) and (2.9.19) we can
obtain the much weaker bound for the AWGN channel
roo
‘3 This Bhattacharyya bound is also valid for asymmetric channels, but it is a weaker bound than
the Chernoff bound in such cases (see Prob. 2.10).
(2.9.20)
) main W,
k¥1
CHANNEL MODELS AND BLOCK CODING 89
and for general binary-input channels
P; <(M — 1) exp wn w,lln / Po(y)pi(y)] (2.9.21)
A seemingly unsurmountable weakness of this approach to the evaluation of
linear codes is that essentially all those long codes which can be elegantly
described or constructed with known distances have poor distance properties. A
few short codes, such as the Golay code to be treated in Sec. 2.11, are optimum for
relatively short block lengths and for some rates, and these are indeed useful to
demonstrate some of the advantages of coding. But a few scattered examples of
moderately short block codes hardly begin to scratch the surface of the remark-
able capabilities of coding, both linear and otherwise. In the next chapter we shall
demonstrate most of these capabilities by examining the entire ensemble of codes
of a given length and rate, rather than hopelessly searching for the optimum
member of this ensemble.
2.10 SYSTEMATIC LINEAR CODES AND OPTIMUM
DECODING FOR THE BSC*
In the last section, we defined a linear code as one whose code vectors are gen-
erated from the data vectors by the linear mapping
V,=U,G m=1,2,...,M (2.10.1)
where G is an arbitrary K x L matrix of zeros and ones. We now demonstrate that
because any useful linear code is a one-to-one mapping from the data vectors to the
code vectors, it is equivalent to some linear code whose generator matrix is of the
form
Ee 0 O 9i.K+1 Gir |
% 1 -Q 0 ee oo Celio |) 2
G= 0 1 0 93,K+1 ne is 931 (2.10.2)
Oot Oar sed Geet? OU FEL
We note first that a linear code (2.10.1) generated by the matrix (2.10.2) has its first
K code symbols identical to the data symbols, that is
is ea ct Se ata (2.10.3a)
and the remainder are as before’* given by
K
eae te OT Se (2.10.35)
k=1
* May be omitted without loss of continuity.
‘* © means modulo-2 summation.
90 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Such a code, which transmits the original K data symbols unchanged together
with L — K “parity-check ” symbols is called a systematic code.
Any one-to-one linear code is equivalent in performance to a systematic code,
as is shown by the following argument. Interchanging any two rows of G or
adding together modulo-2 any combination of rows does not alter the set of code
vectors generated; it simply relabels them. For, denoting the row vectors of G as
Zi, Zo, ---, x, we note that interchanging the two rows g; and g, changes the
original code vectors
Vin = Umi 81 B°** O Uni Bi B°** OB Uj Bj; B°** OB Unk Sx
into the new code vectors
Vin = Umi 21 @°** ®D Uni 8; D°** OB Uj Bi OB *** DB Unx Sx
But, since u,,; and u,,; take on all possible combinations of values, the set {v,,} is
identical to the set {v,,} except for relabeling. Similarly, adding row g; to row g;
changes the original set into the new set of code vectors
Vin = Umi 21 @ +": © Umi(8; ® 8;) O° ® UmjBj B® Umk Bx
or Vn DB Uni S;
But, since u,,;g; 1s itself a code vector, as a consequence of the closure property
demonstrated in the last section, adding the same code vector to each of the
original code vectors again generates the original set in different order. Hence
{v5} = {Vp}
To complete the argument, we perform row additions and interchanges on the
generator matrix in the following order. Beginning with the first nonzero column j,
we take the first row with a one in the jth position, interchange its position with
the first row, and add it to all other rows containing ones in the jth position. This
ensures that the jth column of the reduced matrix has a one in only the first row.
We then proceed to the next nonzero column of the reduced matrix which has a
one in any of the last K — 1 rows, interchange rows so there is a one in the second
row, and add this second row to all rows (including possibly the first) with ones in
this position. After K such steps we are left either with K columns, each having a
one in a single different row, or with one or more zero rows at the bottom of the
matrix; the latter occurs when the original matrix had two or more linearly
dependent rows. In the latter case, the reduced generator matrix, and hence also
the original G, cannot generate 2* different code-vectors; hence the mapping is not
one-to-one and corresponds, therefore, to a poor code since two or more data
vectors produce the same code vector. In the first case, we might need to inter-
change column vectors in order to arrive at the generator matrix of (2.10.2). This
merely results in a reordering of the code symbols.!°
'S This does not alter the performance on any binary-input memoryless channel; it might, however,
alter performance on a non-binary-input channel, for which each signal dimension depends on more
than one code symbol; this is not of interest here.
CHANNEL MODELS AND BLOCK CODING 91
Thus, whenever the code-generator matrix has linearly independent rows and
nonzero columns, the code is equivalent, except for relabeling of code vectors and
possibly reordering of the columns, to a systematic code generated by (2.10.2).
We therefore restrict attention henceforth to systematic linear block codes,
and consider, in particular, their use on the BSC. We demonstrated in Sec. 2.8 that
maximum likelihood decoding of any binary code transmitted over the BSC is
equivalent to minimum distance decoding. That is,
H,= H,, if d,(y)<d,(y) for all m’ #m (2.10.4)
with ties resolved randomly. If we take y, € {0, 1} and x,,, = Um, € {0, 1}, the Ham-
ming distance is given by
d,,(y) = W(Xm ® y)
= w(V,, ® y) (2.10.5)
Also, since the code and signal symbols are the same here, L = N. Thus decoding
might be performed by taking the weight of the vector formed by adding
modulo-2 the received binary vector y to each possible code vector and deciding
in favor of the message whose code vector results in the lowest weight.
We now demonstrate a simpler table-look-up technique for decoding
systematic linear codes on the BSC. Substituting (2.10.3a) into the right side of
(2.10.3b) and adding »,,,, to both sides of the latter, we obtain
K
0} D 0.56 O0., 0 n= K + 1, K+2,...,L
k=1
or, in vector form
0=v,,H? (2.10.62)
where H’ is the L x (L— K) matrix
r 91, K+1 F124
92,K+1 92,1.
H"=| gx.x+1 °° 9x (2.10.65)
1
1 0
ho o@ any
Its transpose, the matrix H, is called the parity-check matrix. Thus, from (2.10.6a),
we see that any code vector multiplied by H’ yields the 0 vector; thus the code
vectors constitute the null-space of the parity-check matrix. Now consider post-
multiplying any received vector y by H’. The resulting (L — K)-dimensional binary
vector is called the syndrome of the received vector and is given by
feryH* (2.10.7)
92 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
This operation can be performed in exactly the same manner as the encoding
operation (2.9.3) or (2.9.4), except here we require an L-stage register and L — K
modulo-2 adders. Obviously, if no errors are made, y = v,, and consequently the
syndrome is zero. Now suppose that the BSC causes an arbitrary sequence of
errors e = (€;, €2, ..., e,), where we adopt the convention that
1 if an error occurs in the nth symbol transmission
e, = ; se
ao if no error occurs in the nth symbol transmission
Then, if v,, 1s transmitted,
y=Vv,,@e (2.10.8)
and
Vn DY=e (2.10.9)
Also, the syndrome is given by
s = yH'= (v,,®e)H’
= eH™ (2.10.10)
Now, for a given received vector y and corresponding syndrome vector s, (2.10.10)
will have M = 2* solutions {e,,= y @v,,}, one for each possible transmitted
vector. However, we have from (2.10.5) that the maximum likelihood (minimum
distance) decoder for the BSC chooses the codeword corresponding to the smal-
lest weight vector among the set {v,, ® y}. But, according to (2.10.9), for systematic
linear codes this indicates that, given the channel output y,
H,=H, if w(e,)<w(e,) for all m’#m (2.10.11)
This then suggests the following mechanization of the maximum likelihood de-
coder for the BSC:
0. Initially, prior to decoding, for each of the 2“~ * possible syndromes s store the
minimum weight vector e(s) which satisfies (2.10.10) in a table of 2“~* L-bit
entries.
1. From the L-dimensional received vector y, generate the (L — K)-dimensional
syndrome s by the linear operation (2.10.7); this requires an L-stage register
and L — K modulo-2 adders.
2. Do a table-look-up in the table of step 0 to obtain é = e(s) from s.
3. Obtain the most likely code vector by the operation
Ym =~Y Oe
and the first K symbols are the data symbols according to (2.10.3a).
The complexity of this procedure lies in the table containing 2/~ * vectors of
dimension L; it follows trivially from step 3 that, because the code is systematic,
each entry can be reduced to just a K-dimensional vector; that is, it is necessary to
CHANNEL MODELS AND BLOCK CODING 93
store only the errors which occurred in the K data symbols and not those in the
L — K parity-check symbols.
As a direct consequence of (2.10.4), (2.10.5), and (2.10.9), it follows that a
maximum likelihood decoder for any binary code on the BSC will decode correctly
if
w(e) < 24min (2.10.12)
where d,,;, is the minimum Hamming distance among all pairs of codewords.
Letting y = x,,, in (2.10.5) it follows that
dinin = Min W(Xin ® Xm’)
m'=m
With the convention that ties are resolved randomly, correct decoding will
occur with some nonzero probability when (2.10.12) is an equality. Thus, when-
ever the number of errors is less than half the minimum distance between code
vectors, the decoder will be guaranteed to correct them. (However, this is not an
only if condition, unless the code vectors are sphere-packed, as will be discussed
below.) Nevertheless, (2.10.12) leads to an upper bound on error probability for
linear codes on the BSC because, as a consequence of (2.9.11), we have
Pr = Py.
< Pr {w(e,,) > 3dmin} (2.10.13)
Then, since e, = 1 with probability p for each n = 1, 2, ..., L, (2.10.13) is just the
binomial sum
alae — pyi* dig odd
k=(dmint+ 1)/2 k
P.<
L
2, (;,}c ae (2.10.14)
— min/2 k
Codes for which (2.10.14) is exact include the Hamming single-error correcting
codes which may conventionally be defined in terms of their parity-check matrix. H
is the parity-check matrix of an (L, K) Hamming code if its L columns (L rows of
H") consist of all possible nonzero L — K binary vectors. This implies that for a
Hamming code
L=2-©_1
An example of H? for a (7, 4) Hamming code is given in Fig. 2.17. Since all rows of
H’ are distinct, each of the L unit-weight (single) error vectors has a different
nonzero syndrome (corresponding to one row of H’). There are, in fact, just
2--K = I + 1 distinct syndromes, one of which is the zero vector, corresponding
to no errors, and the remaining L correspond to the single-error vectors. For note,
from step 0 of the syndrome table-look-up decoder, that the minimum weight
94 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Hl=
1|
1
0
1
0
0
1
Sy iss eee, pen
or OF- kK OF
Figure 2.17 Transpose of parity-check matrix for (7, 4) Hamming code.
error vector should be used for each syndrome. Since here all the unit-weight error
vectors correspond to all the distinct nonzero syndromes, the Hamming codes all
correct one error and only one error. This can also be verified by showing that
da... == 3 (see Prop: Lal}
It is instructive to investigate the linear code generated by the H matrix of the
Hamming code (which is called its dual code)
G=H (2.10.15)
Hamming
This is a K x L matrix where
Be Se ee |
and the columns consist of all possible nonzero K-dimensional binary vectors.
Figure 2.18 shows the generator matrix of the (7, 3) code, which is the dual of the
(7, 4) Hamming code whose transposed parity-check matrix was given in
Fig. 2.17. In addition, to its right in Fig. 2.18 we adjoin the all zero column to
create an (8, 3) code. We can generalize to a (2*, K) code whose generator matrix
is the transpose of the (2 — 1) x K matrix H’ of a Hamming code augmented
by an all-zero column, and can show that every nonzero codeword of this
augmented code has weight
w(V,,) = L/2
=o or alme t | (2.10.16)
For any code vector
Vin = Uy G = Uni 81 B U2 2 O*** O Unk SK
where g, is the kth row of G. Also, since the data symbols u,,, are zeros and ones,
Vn 1S the modulo-2 sum of the remaining rows of G, after some subset of the rows
has been deleted. But we note that deletion of one row results in a matrix of
L=2* columns of dimension K — 1, where each of the possible 2X~! binary
columns appears exactly twice; similarly deletion of j rows results in a matrix of
L=2* columns of dimension K —j with each of the possible 2*~/ columns
repeated exactly 2/ times. But in each case, half of these 2§~/ columns contain an
odd number of ones and the other half an even number. Hence, adding all the
|
eet | 0
2 Sosa ie Gartowe 9 Sree BE Siar Seeatee,® Sele df ote 2 | O| Figure 2.18 Generator matrices for (7, 3) regular
0 0 | 0} simplex and (8, 3) orthogonal codes.
CHANNEL MODELS AND BLOCK CODING 95
nondeleted rows modulo-2 is equivalent to adding all the nondeleted symbols of
the L columns, half of which have even parity and the other half odd. Thus the
result is L/2 zeros and L/2 ones; hence, the desired result (2.10.16).
Equation (2.10.16) also implies, by the closure property (2.9.6), that the Ham-
ming distance between all pairs of codewords is
W(Vin ® Vn) = L/2
= 2*-! for all m’ +m
Consequently, the biphase-modulated signals generated by such a code (aug-
mented by the additional all-zeros column) are all mutually orthogonal, for the
normalized inner product for any two binary signals is in general
1 d i
Te | tmnt) de = Fe LL = 20m ® Yall
£4 2 B Ye) (2.10.17)
i
For the code under consideration, we thus have
L
| Xm(t)Xm(t) dt =O for all m+ m’ (2.10.18)
0
Returning to the original code generated by the K x (2* — 1) matrix G of
(2.10.15), we note that the weight of each nonzero code vector v,, is unchanged
when the additional all-zeros column (of Fig. 2.18) is deleted. However, the
biphase signals derived from the code are no longer orthogonal since now
L = 2* — 1. From (2.10.17) we obtain
1 i 2W(Vin © Vm’)
Gg, |, *m(t%m(t) dt = 1
DK
im gai |
. 1
2k — 1
1
as ar for allm+m (2.10.19)
This code is called a regular simplex or transorthogonal code. It is easily shown
(Prob. 2.5) that (2.10.19) corresponds to the minimum average inner product of
any equal-energy signal set. We shall discuss the relative performance of the
orthogonal and regular simplex signal sets in the next section.
Considerable attention has been devoted, since the earliest days of informa-
tion theory, to the study of numerous classes of linear block codes, and partic-
ularly to algebraic decoding algorithms which are of reasonable complexity and
96 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
do not require the exponentially growing storage of the syndrome table-look-up
approach which we have described. While some very elegant and reasonably
powerful linear codes and decoding techniques have been discovered, particularly
among the class of “cyclic” codes, these codes fall far short of the performance of
the best linear codes, as will be determined in the next chapter. Also, the more
readily implementable decoding algorithms, while guaranteeing the correction of
a given number of errors per block, are generally suboptimum and restricted to
hard quantized channels such as the BSC for binary codes. The last, and probably
most important, cause for the limited practical success of linear block codes is the
generally far superior capabilities of linear convolutional codes, to be discussed in
Chap. 4.
Much of the material in these last two sections can be generalized to non-bin-
ary-code alphabets, and specifically to data and code alphabets of size q, where q is
either a prime or some power of a prime. For practical storage and implementa-
tion purposes, one almost always requires q to be a power of 2. While such
generalization is straightforward, it requires the development of some elementary
concepts of finite field theory. The limited utility of the results does not seem to
warrant their inclusion here. Excellent treatments of algebraic codes over binary
as well as nonbinary alphabets are available in Berlekamp [1968], Gallager [1968],
Lin [1970], Van Lint [1971], Peterson and Weldon [1972], Blake and Mullin
[1976].
2.11 EXAMPLES OF LINEAR BLOCK CODE PERFORMANCE
ON THE AWGN CHANNEL AND ITS QUANTIZED
REDUCTIONS*
In this section, we consider briefly the performance of the two most commonly
used linear block codes for a biphase- (or quadriphase-) modulated AWGN chan-
nel, both without and with output quantization. First we consider the classes of
orthogonal and regular simplex signals. We found in Sec. 2.5 that the performance
of orthogonal signals on the AWGN channel is invariant to the particular wave-
forms used. Hence, we have the union-Bhattacharyya bound (2.3.19) or the
tighter Gallager bound (2.5.12) with M = 2* and & = 2*&,. One can also readily
obtain the exact expression (see Prob. 2.4) which is
P, = 1 enr|i—(x+ Fey dx (2.11.1)
T *—w o
This integral has been tabulated for M = 2* for all K up to 10 (see Viterbi [1966]).
It is plotted in Fig. 2.19, for K = 6, as a function of &,/N, where &, is the energy
per transmitted bit, which is related to & and K by the relation
on nee,
tom TTR
(2.11.2)
* May be omitted without loss of continuity.
CHANNEL MODELS AND BLOCK CODING 97
107! T T
! T T ae | a
g Upper bound oa
8 Orthogonal “d
is Two-level quantization _,
Exact
Golay (24, 12)
10-2 & Two-level quantization 3
ee | gas ~
ae: :
= a <
S x 4
! bs ~
—
oy 8 a
= Upper bound
= rT + Exact be
< Orthogonal
= 10-4 = No quantization =
L J
10> 4
¥ Upper bound =
a Golay (24, 12) *
— No quantization a
10-° ao | | | |
0 ] 2 3 4 > 6 7 8 9 10
&,/No dB
Figure 2.19 Error probability for 2° orthogonal and Golay (24, 12) coded signals on the AWGN
channel.
The regular simplex signal set performs exactly as well as the orthogonal signal set
for, as is evident from Fig. 2.18, one symbol or dimension is identical for all signals
in the set; hence, it might as well not be transmitted for it does not assist at all in
discrimination between signals. However, in so dropping the rightmost symbol
from the orthogonal code to obtain the regular simplex code, we are actually
98 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
reducing the energy per transmitted bit to
6,
a [eed
6,= 2% - 2
=6 (1 = ? a oe
This means that the error probability curve as a function of &,/N, of Fig. 2.19 is
actually translated to the left by an amount 10 log,, (1 — 2-*) dB which for
K = 6 is approximately 0.07 dB. For comparison purposes, the union bound for
orthogonal codes, obtained from (2.3.4) and (2.3.10), is also shown.
Now let us consider the limiting case of two-level (hard) quantization so that
the AWGN channel is reduced to the BSC. In this case, we have the general bound
(2.10.14). For orthogonal codes, however, this bound is very weak. For, while
dinin = 28~ 1 = L/2, it is possible to decode correctly in many cases where the
number of errors is greater than L/4 because of the sparseness of the codewords in
the 2*-dimensional space. In fact, the bound (2.10.14) becomes increasingly poor
as K increases (see Prob. 2.12). On the other hand, we may proceed to bound the
BSC performance by using the union bound (2.3.4), resolving ties randomly
aK
Pes > Pr {(wly @ xn) > wy © x1) |x1}
m’=2
+ 3 Pr {wy © Xw) = w(y © X1)|X3}
= (2* — 1)[Pr {more than 2*~? errors in 2*~' positions}
+ 5 Pr {2*~? errors in 2*~* positions}]
= 0-1) a ("Joa - pee
k=(2K-2+1)
1 aoe 2K-2 2K-2
a 5K-2 Pp (1 — p) (2.11.3)
where p = Q(,/2&,/N, ). This result is also plotted for K = 6 in Fig. 2.19. Again,
the performance for regular simplex codes is the same but the transmitted energy
is slightly less.
Probably the most famous, and possibly the most useful, linear block codes
are the Golay (23, 12) and (24, 12) codes, which have minimum distances equal to 7
and 8 respectively. The former is called a perfect code which means that all spheres
of Haraming radius r around each code vector v,, (i.e., the sets of all vectors at
Hamming distance r from the code vector) are disjoint and every vector y is at
most a distance r from some code vector v,,. The only nontrivial’® perfect binary
codes are the Hamming codes with r = 1, and the Golay (23, 12) code with r = 3.
The (24, 12) code is only quasi-perfect, meaning that all spheres of radius r about
each code vector are disjoint, but that every vector y is at most at distance r + 1
'© Two codewords of odd length that differ in every position form a perfect code and there are many
perfect codes with d,;, = 1.
CHANNEL MODELS AND BLOCK CODING 99
from some code vector v,,. Here again r = 3. It is easy to show that perfect and
quasi-perfect codes achieve the minimum error probability for the given values of
(L, K). This second code is actually used more often than the first for various
reasons including its slightly better performance on the AWGN channel. The
Golay codes are among the few linear codes, besides the Hamming and ortho-
gonal classes, for which all the code vector weights are known. These are sum-
marized in Table 2.2. While an exact expression for P; on the AWGN channel is
not obtainable in closed form, given all the code vector weights, we may apply the
union bound of (2.9.17) and thus obtain
P,< ¥ N,Q./(26,/N,)w (2.11.4)
wew
where the index set W and the integer N,, are given in Table 2.2. This result is also
plotted in Fig. 2.19 and, although it is only a bound, it is reasonably tight as
verified by simulation.
On the BSC, for the (24, 12) code, minimum distance decoding always cor-
rects 3 or fewer errors and corrects one-sixth of the weight 4 error vectors. On the
other hand, error vectors of weight 5 or more are never corrected, since by the
quasi-perfect property, there exists some code vector at a distance no greater
than 4 from every received vector y. Similarly for the (23, 12) code all
error vectors of weight 3 or less, and only these, are corrected. Hence for the
(23, 12) code the expression (2.10.14) holds exactly. For the (24, 12) code we can
multiply the first term in (2.10.14) by 5/6 and also obtain an exact result. This
result for p = Q(./26,/N,), L = 24, dnin = 8, &, = 26, is plotted in Fig. 2.19.
A potentially disturbing feature of the above results is that in each case we
have determined the block error probability. But for orthogonal and regular sim-
plex codes, we have used K = 6 bits/block while for the Golay code we have
K = 12 bits/block, and we would expect that the block error probability might be
Table 2.2 Weight of code-vectors in Golay
codes (Peterson [1961])
Number of code-vectors of
weight w, N,,
Weight, w (23, 12) code (24, 12) code
0 1 1
7 253 0
8 506 759
11 1288 0
12 1288 2576
15 506 0
16 253 759
23 1 0
24 0 1
Total 4096 4096
100 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
influenced by the number of bits transmitted by the block code. We may define bit
error probability P, as the expected number of information bit errors per block
divided by the total number of information bits transmitted per block. For orthog-
onal and regular simplex codes, all block errors are equiprobable since all 2*
code vectors are mutually equidistant. Thus, since there are (4) ways in which k
out of K bits may be in error and since an error will cause any pattern of errors in
the data vector with equal probability P, /(2* — 1), it follows that for orthogonal
and regular simplex codes over any of the channels considered
KEE er
which, for all but very small K, is very nearly
P, = P,/2
The evaluation of P, is not nearly as simple and elegant for other linear block
codes and in fact depends on the particular generator matrix chosen. However, for
the Golay (24, 12) code with a systematic encoder, we may argue approximately
as follows. Block errors will usually (with high probability) cause a choice of an
incorrect code vector which is at distance 8 from the correct code vector. This
means that one-third of all code symbols are usually in error when a block error is
made. But since the code is systematic and half the code symbols are data symbols,
the same ratio occurs among the data symbols. Hence, it follows that approxi-
mately, P, ~ P,,/3. In general, in any case, we have trivially, P, < P, and also the
lower bound P,,/K < P,,. Hence the upper bounds on P, are also valid for P,,, and
the comparison of P, for two codes is nearly as useful as that of P, even when the
block lengths are different.
Comparison in Fig. 2.19 of the performance of each code on the AWGN
channel and on its hard quantized reduction, the BSC, indicates that hard quanti-
zation causes a degradation of very nearly 2 dB. This result is best explained by
using the union-Bhattacharyya bound (2.9.19). By this bound
M
Po ere (2.11.5)
where
= —In ¥ \/po(y)pi(9)
is a function of the quantization procedure, while w,, w3,..., Wy, the weights of
the nonzero codewords, are invariant to quantization. As also demonstrated by
CHANNEL MODELS AND BLOCK CODING 101
(2.3.17), for the AWGN’’ channel
d=—In[ /poW)Pal—y) dy
© 26,\’ at +? dy
= — ee : 4 “i 2 7 it ities coe
ex [-(o— Je) [A [-(+ Je) [4]
= €./N. (2.11.6)
For the BSC on the other hand, we have shown in Sec. 2.9 that
d= —In ,/4p(1 — p) (2.11.7a)
where
p = O(,/28,/N,) (2.11.7b)
But in the case of orthogonal codes
SIN. = KD “SIN.
which is extremely small when K > 1. Similarly, for any code in which
L> Ké,/N,
ae 2
aes ee F
Sere Pee
In such cases (2.11.7b) approaches
1 E
Be 2.11.8
PX 5 cata ( )
Thus for the BSC with &,/N, « 1 (or, almost equivalently, L> K)
eel Bae (2.11.9)
i tn, (2N,
Comparing (2.11.6) and (2.11.9), we see that in order to obtain the same bound
(2.11.5), we must increase the energy by a factor 2/2 (2 dB) for the BSC relative to
the AWGN channel. Even though (2.11.9) has been shown under the condition
that &,/N, < 1, the approximate 2 dB degradation for two-level quantization seems
empirically to hold even when this condition is not met (see, for example,
Fig. 2.19). Cases of intermediate quantization are also readily evaluated (see
Prob. 2.13) and the resulting d is easily computed. Other measures of quantization
loss will also be considered in the next chapter.
‘” For po(y), the random variable y can be taken to have mean ,/&, and variance N,/2, or
equivalently we may normalize it to have mean ./26,/N, and unit variance. The latter is used in
(2.11.6).
102 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Upon initially defining linear block codes in Sec. 2.9, we showed that they
could be used in conjunction with multiple-amplitude and multiple-phase modu-
lation by using several code symbols to select each signal symbol (or dimension).
However, we have given no examples of performance of such signal sets. One
reason is that the uniform error property P,;, = P,; does not generally hold for
such cases, making the analysis of a particular code much more complex; another
is that the results are much less revealing. On the other hand, in the next chapter
we shall develop the technique of ensemble performance evaluation, which is no
more difficult for these cases of nonbinary modulation than for biphase (or
quadriphase) modulation.
2.12 OTHER MEMORYLESS CHANNELS
Thus far we have concentrated exclusively on the AWGN channel and its
quantized reductions, all of which are memoryless channels. These channel
models apply most accurately to line-of-sight space and satellite communication.
As a result, since such channels have become commonplace, coding to improve
error performance in digital communication has been most prevalent in these
applications.
2.12.1 Colored Noise
Yet even with the AWGN channel, certain imperfections invariably enter to
degrade performance, some of which were discussed in Sec. 2.6. For example,
intersymbol interference is caused by linear filtering in the transmitter, channel, or
receiver when the “ predetection ” filters are not sufficiently wideband for the given
signal. But receiver filtering also modifies the noise spectral density so that the
white noise model is no longer appropriate. The resulting zero-mean noise with
nonuniform spectral density is called colored. It can be treated in either of two
ways. The rigorous theoretical approach is to expand the noise process in a
Karhunen-Loéve series
N
n(t)= lim ) n,¢,(t)
N>o n=l
where the {@,(t)} are normalized eigenfunctions of the noise covariance function
and the {n,} are independent Gaussian variables with zero means and variances
equal to the eigenvalues of the noise covariance function (Helstrom [1968], Van
Trees [1968]). In particular, if the noise covariance function is positive
definite, the eigenfunctions form a complete basis for finite-energy functions so
that the signals {x,,(t)} can also be represented in terms of their projections on the
basis {@,(t)'. We then have the representation
x,,(t) = lim Faith)
No n=1
CHANNEL MODELS AND BLOCK CODING 103
where
T | y= 1, a nae N
Xma = | Xm(t)P,(t) dt m= iF - “9 M
and the channel can be represented as an infinite-dimensional additive vector
channel
y=x, +n when H,, is the transmitted message
wherein the individual variances of the noise components differ from dimension to
dimension. One can then conceive of coding the signal projections {x,,,,} for this
channel model which is memoryless, but not constant since the noise variance
varies from dimension to dimension. Such a development has been carried out by
Gallager [1968] who obtained the code ensemble average error probability under
a constraint on the signal energy. However, no practical channel could be rea-
sonably encoded in this way.
An alternative and more direct, though less rigorous, approach to colored
noise, proposed by Bode and Shannon [1950] (see also Wozencraft and Jacobs
[1965], Chap. 7) is to “whiten” the noise by passing the received process through
a whitening filter, the squared magnitude of whose transfer function is the inverse
of the noise spectral density. While this also distorts the signal, it does so in a
known manner so that the result is a known, though distorted, signal set in white
Gaussian noise which can be treated as before. The weakness of this approach is
that it ignores boundary effects for finite-time signals and is hence somewhat
imprecise unless the signal symbol durations are long compared to the inverse
noise bandwidth. Probably the best approach to this problem is to guarantee that
the receiver predetection bandwidth is sufficiently wide, compared to the inverse
symbol time, and that the noise spectral density is uniform in the frequency region
of interest, so that the white noise model can be applied with reasonable accuracy.
2.12.2 Noncoherent Reception
Another degrading feature, noted briefly in Sec. 2.6, is that of imperfectly known
carrier phase, as well as imperfectly known carrier frequency and symbol time.
While the latter two parameters must always be estimated with reasonable accur-
acy, for any digital communication system will degrade intolerably otherwise, it is
possible to operate without knowledge of the phase. Referring to Table 2.1 in
Sec. 2.6 and to Fig. 2.9, we suppose that we have only two frequency-orthogonal
signals whose frequency separation is a multiple of 2z/T radians per second. Note
that this is the separation required for quadrature-phase frequency-orthogonal
functions; the same separation is necessary when the phase is unknown, for in this
event the sine and cosine functions will be indistinguishable upon reception. Thus
we have
Xm(t) = ./26 f(t) sin (o,t +o) m=1,2 (2.12.1)
where f(t) is a known envelope function of unit norm, @,, is some multiple of
104 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Vis
ie seks 6 Ec»
JV 2f(t) sin wt
” yz
2
h
Vie
' ii UES gc tna Cy
y(t) i Maximum
ese V 2f(t) cos wt detector [
B's
i da i Be ea ( )?
V 2F(0) sin Wt
2
A
Ve
ame 2 sat
V 2f(t) COS Wt
Figure 2.20 Optimum demodulator for noncoherent reception.
2n/T, Ww, #@,, and @ may be taken as a random variable uniformly
distributed on the interval 0 to 2z. This is generally called noncoherent reception.
It is clear that the optimum demodulator (Fig. 2.20) consists of two devices each
equivalent to those required by a quadrature-phase signal. When signal x,(t) is
sent, the set of four observables is
where
Vis = /€ cos o+n,,
ye = JSé sin o+n,,
y2s — nN.
Y2e = N2¢ (2.12.2)
Img = f'2 Ee n(t) f(t) sin @y,t dt |
. lipgee $2
Mme = 4/2 | n(t) f(t) cos w,,t dt |
T
0
all four of which are mutually independent with zero mean and variance N,/2.
CHANNEL MODELS AND BLOCK CODING 105
The likelihood function, when message | is sent and the phase is @, is therefore
PalV1s> Vico Y2s> Ye] X15 p)
ses exp { ee [(15 Ey Jé cos ¢) + (Vie ‘iat Jé sin o)° + V3s + Y3cl/No}
. (xN,)°
(2.12.3)
But @ is a uniformly distributed random variable and thus the likelihood function
of the observables y, given message 1, is just (2.12.3) averaged over ¢, namely
PalV1s> Vic> Y2s> V2e|X1)
2n
1
= P2(V2s> Yao) 57 | P2(Vis> Viel X1» %) ad
e” 27/2 e E/No— ¥17/2
(nN,)’
dg
dn
oO
1 2é
2
Yn = 3 (yee + Yas) (2.12.5)
: ie exp Pye [vi1, cos @ + yy, Sin él
where
and where
2n
Iy(x) = ts ex cer) dh
: 2n Jo
is the zeroth order modified Bessel function which is a monotonically increasing
function of x. By symmetry, it is clear that p4(y|x.) is the same as p4(y|x,) with
the subscripts 1 and 2 interchanged throughout. Thus the decision rule for two
messages is, according to (2.2.7)
Ay, = A, if In ps(y|x1) > In pg(y|x2)
Ay =H, iin Io = yi) > im tol y]
Since I, is a monotonically increasing function of its argument, this is equivalent
to
or in this case
Hy, a Hi, if yi Pe y2 (2.12.6)
Thus the decision depends only on the sum of the squares of the observables, y7
and y3, of each demodulator for each signal (or any monotonic finite function
106 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
thereof) whose generation from y,,, is, Y2-, and y, is as shown in Fig. 2.20.
Henceforth then, we may consider y, and y, to be the observables. It follows from
(2.12.4) that these observables are independent, i.e., that
P2(¥1> Y2|X1) = P(y1|X1)P(y2| x1)
Also from the definition (2.12.5) which is equivalent to a Cartesian-to-polar coor-
dinate transformation, and from the result of (2.12.4), it follows that
P(y1 | x1) = TN V1 P2(Vie> Vis|X1)
ne V1 errata +)
P(y2| x1) = TN V2P2(Vae> Y2s|X1)
= py, @ 72? CUA)
It is then relatively simple to obtain the error probability for noncoherent
demodulation of two frequency-orthogonal signals. For
Pr, =Prty2>y1 |x1} = I P(y1 |x1) | P(y2|X1) dy, dy,
Vi
—_ ie | e 17/2 9 €/No | 26 Jere dy,
0
N,”!
— g-#/2No | ye y2 ete I,| yr)
= 4 e~ €/2No (2.12.8)
By symmetry,
Pr, = Fg,
= P,
=e te F{PNe (2.12.9)
Generalization to M frequency-orthogonal signals of the form (2.12.1) is
completely straightforward. The demodulator becomes a bank of devices of the
type of Fig. 2.20. The error probability can be obtained as an (M — 1)-term sum-
mation of exponentials (Prob. 2.14) and an asymptotically tight upper bound can
be derived which is identical to that for coherent (known phase) reception of M
orthogonal signals, given by (2.5.16) (see Prob. 2.15). This result does not imply,
however, that ignorance of phase does not in general degrade performance. The
fact that the performance of noncoherent reception of M orthogonal signals is
asymptotically the same as for coherent reception is explained by noting that, as
M becomes larger, so does T, and consequently the optimum receiver effectively
estimates the phase over a long period T in the process of deciding among the M
possible signals. As an example of the opposite extreme, consider a binary coded
CHANNEL MODELS AND BLOCK CODING 107
system of the type treated in the previous section where each binary symbol is
transmitted as one of two frequency-orthogonal signals (2.12.1) which are demod-
ulated symbol by symbol, resulting in a BSC with transition probability p given
by (2.12.9) with € = &,. Now when &,/N, < 1, the union-Bhattacharyya error
bound for such a coded system is the same as (2.11.5) but with the Bhattacharyya
distance given by
d= —In ,/4p(1 — p)
= —In ,/2 e~ 4s/2No(j — 4 e@ $s/2No)
~ —In /(1— 6,/2N,\1 + &/2N,)
1/é,\?
aaa a 12.1
3(=) (2.12.10)
This is clearly a great degradation relative to the coherent case for which
d & (2/n)6,/No when &,/No < 1. One would suspect initially that a cause of this
degradation is that the distance between signals for each symbol is reduced by a
factor of 2 by the use of orthogonal signals compared to biphase signals, for the
latter are opposite in sign and consequently have ||s, — s, || = 2&,. There is in
fact a technique applicable to noncoherent reception, called differential phase shift
keying (see, e.g., Viterbi [1966], Van Trees [1968]) which effectively doubles the
energy per symbol and produces the error probability of (2.12.9) with energy
doubled. But this is clearly not a sufficient explanation because even if we used
double the energy in the noncoherent case, we would merely multiply (2.12.10)
by a factor of 4 and this would still be a negligibly small d compared to the
coherent case when &,/N, < 1. The situation is somewhat improved with opti-
mum unquantized decoding, but there is still significant degradation.
There is in fact no justification in a coded system for not measuring the phase
accurately enough to avoid this major degradation, provided, of course, that the
phase varies very slowly relative to the code block length, as assumed here. When
the phase varies rapidly, this is usually accompanied by rapidly varying ampli-
tude, and the channel may be characterized as a fading-dispersive medium, the
case which we consider next.
2.12.3 Fading-Dispersive Channels
A more serious source of degradation, prevalent in over-the-horizon propagation
such as high-frequency ionospheric reflection and tropospheric scatter communi-
cation, is the presence of amplitude fading as well as rapid phase variations. The
model of this phenomenon is usually taken to be a large number of diffuse scat-
terers or reflectors which move randomly relative to one another, causing the
signal to arrive at the receiver as a linear combination of many replicas of
108 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
the original signal, each attenuated and phase shifted by random amounts. By
the central limit theorem, the distribution of the sum of many independent
random variables approaches the Gaussian distribution. Hence a sinusoidal signal
/26 f(t) sin w,,t will arrive at the receiver as
y(t) = ./26 f (t)[a(t) sin ,,t + b(t) cos @,t] + n(t) O<t<T (2.12.11)
where a(t) and b(t) are independent zero-mean Gaussian processes, with given
covariance functions, and where n(t) is AWGN of thermal origin present in the
observation. While we might consider more general signal sets, it should be clear
that, in view of the random amplitude and phase perturbation by the channel,
signals can be distinguished only by frequency. Each received signal, aside from
the additive noise n(t), is a Gaussian random process with bandwidth dictated by
the propagation medium and determined from the spectral densities of a(t) and
b(t). If the frequencies w,, are spaced sufficiently far apart compared to their
bandwidth, the signal random processes will have essentially nonoverlapping
spectra and the problem reduces to that of detecting one of M “orthogonal”
random processes. Once the observable statistics have been established, the prob-
lem is very similar to that of M orthogonal deterministic signals treated in
Sec. 2.5, except that the decoding involves quadratic rather than linear operations
on the observables (Helstrom [1968], Kennedy [1969], Viterbi [1967c]).
A more realistic model, less wasteful of bandwidth, more amenable to coding,
and more representative of practical systems, results from assuming that over
short subintervals of T/N seconds the random signal is essentially constant. Then,
assuming signal pulses of duration T/N during a given nth subinterval, we have
the received signal
y(t) = ./26, s(t — te sin @,t + b cos @,t] + n(t)
(n—1)T/N <t<nT/N,m=1,2 (2.12.12)
where a and b are zero-mean independent Gaussian variables with variance 07, w,,
is a multiple of 2xN/T, &, = &/N is the symbol energy, and f(t) with unit norm is
as defined in (2.6.5). Defining
r= j/a?+b? =tan™! (b/a) (2.12.13)
we may rewrite (2.12.12) as
y(t) = \/28 rf (t — nT/N) sin (w,,t + $) + n(t)
(n—1)T/N <t<nT/N,m=1,2 — (2.12.14)
The statistics of r and ¢@ are easily obtained from those of a and b by the
CHANNEL MODELS AND BLOCK CODING 109
transformation!®
a=rcos@
rr, 6) = He 3 lab —
=p(?)p(\r) O<@<22,r>0 (2.12.15)
Thus, ¢ is uniformly distributed on [0, 27] and r is Rayleigh distributed ; hence the
term Rayleigh fading.
We shall limit attention primarily to a binary input alphabet (M = 2) based
on two frequency-orthogonal signals, although generalization to a larger set of
frequencies is straightforward. Comparing (2.12.14) with (2.12.1), we note that the
only difference is the random amplitude in the former. But since the quadrature
demodulator of Fig. 2.20 is optimum for a uniformly distributed random phase @
and any amplitude, it is clear that the fact that the amplitude is a random variable
is immaterial. Assuming for the moment that we are merely interested in one
symbol (or alternatively that the random variables r and ¢, or a and b}, are
constant over the entire T seconds), we may readily evaluate the error probability
for the Rayleigh fading binary frequency-orthogonal signals from that for non-
coherent detection of fixed amplitude signals. For, if r were known exactly, using
the optimum demodulator of Fig. 2.20,' we would have error probability for
noncoherent reception of (2.12.9) with & replaced by &,r?. Hence
P(r) = fe-77402
Now since r is a random variable whose distribution is given by the second factor
of (2.12.15), we see that the symbol error probability with Rayleigh fading is
Pe= | plr)Pe() dr
Sette Pl 26? N,
1
2(1 + o7&,/N,)
‘a 1
2+ &é,/No
(2.12.16)
18 For the rectangular-to-polar transformation used here, the Jacobian is
b
J (=) =F
r,
'? The demodulator integrates for T/N second intervals here rather than T seconds as shown in
Fig. 2.20.
110 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where we have denoted the average received energy per symbol by
t, #£; | r?p(r) dr = 2076, (2.12.17)
0
It is quite interesting to note that while phase randomness does not destroy the
exponential dependence of P, on energy-to-noise ratio, amplitude randomness
does change it into the much weaker inverse linear dependence.
Let us now consider the demodulation and decoding of multidimensional, or
multiple symbol, coded Rayleigh fading signals. The most common form of coding
for Rayleigh fading is the trivial repetitive code, using the same signal for all N
dimensions, which is generally called diversity transmission. Before proceeding
with the analysis even in this case, we must impose a fundamental assumption on
the communication system: namely, that the random channel amplitude and
phase variables are independent from symbol to symbol. Several techniques are
commonly used to achieve this independence. First, different pairs of frequencies
can be used for successive symbols. If the frequency pair for one symbol is widely
separated from that of the next few symbols, the necessary independence can
usually be acquired, but at the cost of greatly expanded bandwidth. Another
approach, space diversity, actually transmits a single symbol, but uses N antennas
sufficiently separated spatially that the random phases and amplitudes are in-
dependent of one another; then the N observables consist of a combination of N
single observables from each antenna-receiver. Of course, spatial diversity cor-
responds only to the case of trivial repetitive coding. When nontrivial coding is
used, particularly when bandwidth must be conserved, a third approach called
time-diversity is commonly employed. This technique achieves the independence
by spacing successive symbols of a given codeword at wide intervals in time,
placing in between similarly spaced symbols of other codewords. This technique,
illustrated in Fig. 2.21 and discussed further below, is generally called interleaving.
Given the independence among symbols, we can consider an N-dimensional
signal where each dimension consists of the transmission of one of two binary
frequency-orthogonal signals. We then have from the demodulator of Fig. 2.20
(with integration over T/N second intervals) the 2N observables (y,, y2,.--, Yv) =
(¥i1V21> V12 22> +++» Yin Y2n), COnsisting of N pairs of observations (where y,,,
V2, is the pair of observables for the nth symbol), for the two possible transmitted
frequencies wm, and w,.
Again for a fixed amplitude r and a uniformly distributed phase ¢, we have
from (2.12.7) that, for the nth symbol, the observables y,,, and y,, are independent
with probability density functions
ore:
Ymn
Hi, ge oon r) a Peas e Ymn?/2 a | N
oO
2
P(Vma'n|Xmn> 1) = Von @ 27/2 mand m'=1o0r2,m'#m (2.12.18)
‘uorKBUIUT[S AJOWOU JOy onbruysa) SuavofsojUuy [7'Z anBiy
a rc al
|
Itlq< Ds, 648 i tia ‘[+lq SfI-D-lq« Cis ‘{T-lq ‘f-lq dq
Big
BULIOPIO JNdUI JOARdIIOJUTOP-jNdyNo IOARd]IO}U]
. ‘E+ Q ‘Ctla ‘ I+! ‘Iq ee
BULIOPIO Jndjno JOAvd]IOJUTOp—-jndul J9Avd]IO}U]
eo * (1-1) >
ges ed i 0 Pe i ee
|*«#—- -—- AZ ~ J) + |
Japoosap ey a JojeyNpowsp Jlapoousd
OL pur wolf
coe Ok f] sominpou [> fo ee NS ac
b+ Ae — 1) —>] esas |/~-——— ‘s ——_> ¥
Th 0 : eee
/K—— Az - 1) ———> en ee a a
o- {1 | a Lee oe om IR@is.
ies oY * Lemaimenammar. - {>|
T= TT} o— °
111
112 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
so that the latter is independent of r. Since r is a Rayleigh distributed variable with
parameter oa”, we have, using (2.12.17)
lo 0) r a ‘
P(Ymn | Xmn) ai i gb oe, lee (25) eT Fees ty
< oO, r2 a”
= Ymn © Yun! 2 ie rates - Fal! +2 N, ft 26 /NoVmnt) ar
ra 2
exp | (6 No)y oe |
= Youn @ 7n0l? AN Ee te
Hence
P(Vmn | Xinn) =
P(Ym'n|Xmn) = Yon'n @- 2m'0!2 mand m’ =1or2,m #m_ (2.12.19)
Examining first the case of trivial repetitive coding of two equiprobable mes-
sages, we have from (2.2.7) that the optimum decoder for equal prior probabilities
and average symbol energies, &,, is
N
Hp=H if Sin ean: Yoral%en) 9 —sfor all yn’ hem
n=1 Dol Venn’ Vint San)
which simplifies, according to (2.12.19), to
N :
Hi=H ) a te 0: for ales em (2.12.20)
n=1
Given that message H,, was sent, we can calculate the error probability by finding
the distribution of the sum in (2.12.20), conditioned on x,,, from (2.12.19). It is
easily shown (Wozencraft and Jacobs [1965, chap. 7]) that this is a chi-square
distribution and that consequently the two-message repetition code error probabil-
ity is given by
No1 ee 3
Pee ot é ig Ja = py (2.12.21)
j=0 J
where
p= : (2.42.22)
2+ 6,/N,
However, more insight can be drawn from deriving the Bhattacharyya upper
CHANNEL MODELS AND BLOCK CODING 113
bound. From (2.3.15) and (2.12.19), we have
P; < d J PCY |Xm)PCY |Xm’)
L HI {| J P(Ymn> Ym'n|Xmn)P(Vmn> Yo'n|Xrm'n) Ten TY m'n
A TEEN, i Gclineaale yf
-{s ts ee °) |
= [4p(1 — p)}* (2.12.23)
where p is given by (2.12.22). It can be shown (Wozencraft and Jacobs [1965,
chap. 7]) that the ratio of the exact expression (2.12.21) to the bound (2.12.23)
approaches [2,/aN(1 — 2p)]~' as Noo so that the bound is asymptotically
tight in an Leckie sense. Finally, we write the bound as
| et Blas (2.12.24a)
where
d = —In [4p(1 — p)] (2.12.24b)
p = 1/(22 + &,/N,)
Both the decoding rule (2.12.20) and the error probability bound (2.12.24) can be
easily generalized to the case where the symbol energies are not equal (Wozencraft
and Jacobs [1965, chap. 7]). A most interesting conclusion can be drawn by
comparing (2.12.16) with (2.12.24a). Both cases deal with the transmission of a
single bit by one of two messages. Suppose the total average received energy is &.
Then in the first case &, = & and P, decreases only inversely with &/N,. In the
second case, using the repetitive N-dimensional code, we have &, = &/N and
Poe or (2.12.25)
where
1+ (&/N,)/N
[1 + (6/N,)/2N]’
_ (€/N.)?
acpi
d=-—
as N— oo
Clearly then, making N very large degrades performance. However, we can readily
show that the maximum of Nd, the exponent of (2.12.25), occurs when
ge (=) 5 (2.12.26)
114 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
in which case (2.12.25) becomes
Pp <ie °"® (2.12.27)
opt
Comparing to the exact expression (2.12.9) for the noncoherent case and ignoring
the multiplicative factor of 5 in the latter, we note that fading thus causes a loss of
about 5.25 dB in effective energy. More important, we conclude that, while repeti-
tive coding has no effect (either favorable or detrimental) for the coherent AWGN
channel, and while it has strictly a detrimental effect when the phase alone is
unknown, it can actually improve performance in the case of fading channels
provided that the dimensionality is chosen properly, the optimum being given
approximately by (2.12.26).
Finally, turning to nontrivial coding, we may again apply the union-Bhattac-
haryya bound as in Sec. 2.9. Then if a binary linear code is used, since it is obvious
from (2.12.19) that the channel is symmetrical, it follows that P; = P, for all m.
Then exactly as in (2.9.19) we have
M
de sii Ye (2.12.28)
=2
where d is given by (2.12.24b) and w, is the Hamming weight of the kth nonzero
codeword.
Of interest also is the effect of quantization. Clearly, the maximum likelihood
decoder output (2.12.20) may be quantized by quantizing the decoder symbol
output set {y2,,— yz.,} to any number of levels. In the simplest case of hard
two-level quantization (positive or negative), this reduces the fading channel to a
BSC with crossover probability given by (2.12.16). But this is exactly equal to the
parameter p defined by (2.12.22); and, for the BSC, we found in Sec. 2.9 that the
Bhattacharyya distance is
dsc = —In /4p(1—p) = —4$1n [4p(1—p)] (2.12.29)
Thus, comparing with (2.12.24b), we find that for the fading channel, hard quantiza-
tion of the decoder outputs effectively reduces the Bhattacharyya distance by a
factor of 2 (3 dB). This is a more serious degradation than for the AWGN channel,
and is a strong argument for “soft” multilevel quantization (Wozencraft and
Jacobs [1965, chap. 7]).
2.12.4 Interleaving
With the exception of the AWGN channel, most practical channels exhibit statist-
ical dependence among successive symbol transmissions. This is particularly true
of fading channels when the fading varies slowly compared to one symbol time.
Such channels with memory considerably degrade the performance of codes
designed to operate on memoryless channels. The simplest explanation of this is
CHANNEL MODELS AND BLOCK CODING 115
that memory reduces the number of independent degrees of freedom of the
transmitted signals. A simple example helps to clarify this point. Suppose a BSC
with memory makes errors very rarely, say on the average once every million
symbols, but that immediately after any error occurs, the probability of another
error is 0.1. Thus, for example, the probability of a burst of three or more errors is
one percent of the probability of a single error. Consider coding for this channel
using the (7, 4) Hamming single-error correcting code. If this were a memoryless
BSC so that errors occurred independently, the probability of error for each
four-bit seven-symbol codeword would be reduced by coding from approximately
7 x 10~° down to approximately 3.5 x 10~!1. On the other hand, for the BSC
with memory as just described, the codeword error probability is reduced to only
about 6 x 10°’. Coding techniques for channels with memory have been
proposed and demonstrated to be reasonably effective in some cases (Kohlenberg
and Forney [1968], Brayer [1971]; see also Secs. 4.9 and 4.10). The greatest problem
with coding for such channels is that it is difficult to find accurate statistical
models and, even worse, the channel memory statistics are often time-varying.
Codes matched to one set of memory parameters will be much less effective for
another set of values, as in the simple example above.
One technique which requires no knowledge of channel memory other than
its approximate length, and is consequently very robust to changes in memory
statistics, is the use of time-diversity, or interleaving, which eliminates the effect of
memory. Since in all practical cases, memory decreases with time separation, if all
the symbols of a given codeword are transmitted at widely spaced intervals and
the intervening spaces are filled similarly by symbols of other codewords, the
statistical dependence between symbols can be effectively eliminated. This inter-
leaving technique may be implemented using the system shown in Fig. 2.21. Each
code symbol out of the encoder is inserted into one of the J tapped shift registers
of the interleaver bank. The zeroth element of this bank provides no storage (the
symbol is transmitted immediately), while each successive element provides j
symbols more storage than the preceding one. The input commutator switches
from one register to the next until the (J — 1)th after which the commutator
returns to the zeroth. J is the minimum channel transmission separation provided
for any two code symbols output by the encoder with a separation of less than
J =jlI symbols. For a block code, J should be made at least equal to the
block length. The output commutator feeds to the channel (including the modula-
tor) one code symbol at a time, switching from one register to the next after each
symbol, synchronously with the input commutator. When the channel input is
not binary, it may be preferable to interleave signal dimensions rather than code
symbols. This is achieved, at least conceptually, by making each stage of the
registers a storage device for a signal dimension rather than a channel symbol
(easily implemented if each dimension contains an integral number of symbols).
It is easily verified that, for a natural ordering of input symbols ..., v;, v;+4,
Vi+2, ---, the interleaver output sequence and hence the channel transmission
ordering is as shown in Fig. 2.21, where it is clear that the minimum separation
116 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
in channel transmission is at least J for any two code symbols generated by the
encoder within a separation of J — 1. This is called an (J, J) interleaver.
The deinterleaver, which must invert the action of the interleaver, is clearly
just its converse. Observables are fed in with each dimension going to a different
shift register. Note, however, that to store the observables digitally, the channel
outputs must have been quantized. Hence, the deinterleaver storage must be
several times the size of the interleaver storage. For example, if the channel input
is binary, we require J(J — 1)/2 bits of storage in the interleaver. On the other
hand, with eight-level quantization at the channel (demodulator) output, each
output dimension contains 3 bits so that the storage required in the deinterleaver
is three times as great. We note also that the delay introduced by this interleaving
technique is equal to J(J — 1) symbol times.
The system of Fig. 2.21 represents a conceptually simple interleaving
technique, and it can be shown to be the minimal implementation of an (J, J)
interleaver in the sense of storage requirements and delay (Ramsey [1970]).
However, shift registers of varying lengths may be considerably more costly in
terms of numbers of required integrated circuits than, for example, a random-
access memory with appropriate timing and control to perform the functions of
the system of Fig. 2.21, even though the total storage of such a random-access
memory will be double that shown in this implementation. The main point to be
drawn from this discussion is that channels with memory can be converted into
essentially memoryless channels at a cost of only buffer storage and transmission
delay. This cost, of course, can become prohibitive if the channel memory is very
long compared to the transmission time per symbol.
2.13 BIBLIOGRAPHICAL NOTES AND REFERENCES
The first half of this chapter, through Sec. 2.8, owes much of its organization to the
text of Wozencraft and Jacobs [1965], specifically chaps. 4 and 5. This text
pioneered in presenting information-theoretic concepts in the framework of prac-
tical digital communication systems. We have deviated by presenting in Secs. 2.4
and 2.5 the more sophisticated upper bounds due to Gallager [1965] and Fano
[1961] to establish the groundwork for the more elaborate and tighter bounds of
successive chapters.
Sections 2.9 and 2.10 are, in part, standard introductory treatments of linear
codes. The proof of the uniform error property for linear codes on binary-input,
output-symmetric channels is a generalization of a proof of this property for the
BSC due to Fano [1961]. The evaluation of error probabilities and bounds for
specific linear codes on channels other than the BSC carried out in Secs. 2.9 and
2.11 is scattered throughout the applications literature. Section 2.12 follows for the
most part the development of chap. 7 of Wozencraft and Jacobs [1965]. The
interleaving technique of Fig. 2.21 is due to Ramsey [1970].
CHANNEL MODELS AND BLOCK CODING 117
APPENDIX 2A GRAM-SCHMIDT ORTHOGONALIZATION AND
SIGNAL REPRESENTATION
Theorem Given M finite-energy functions {x,,(t)} defined on [0, T], there exist
N <M unit-energy (normalized) orthogonal functions {@,(t)} (that is, for
which {3 ¢,(t)@,(t) dt = 6,,,) such that
x,,(t) = Y> Xan al) 6262 ...M (2.1.1)
where for each m and n
y if
Xnn = | Xm(t)Pa(t) dt
0
Furthermore, N = M if and only if the set {x,,(t)} is linearly independent. The
{o,(t)} are said to form a basis for the space generated by the set of functions
{Xm(C)}.
ProoF Let &,, = § x;,(t) dt. Define the first normalized basis function
g(t) = OSE, (2A.1)
Then clearly
x(t) = /6; b(t)
= X11 0,(t) (2A.2)
where x,, = ./&, and ¢,(t) has unit energy as required. Before proceeding to
define the second basis function, define x,, as the projection of x(t) on ¢,(t),
that is
war = { xalt)bx(0) dt 2A3)
Now define
$,(t) = 2 ie $:(t) (24.4)
where
oe? ee (2A.5)
118 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
It then follows from (2A.3) and (2A.4) that
d i
| brlt)po(e) dt = 0 (24.6)
and from (2A.4) and (2A.5) that ¢,(t) has unit energy since
T C7
| $3(t) dt = wa Pn
0 X22
=1 (2A.7)
Also, from (2A.4), we have
X(t) = X21 $1(t) + X22G2(t) (2A.8)
and from (2A.6) and (2A.7) it follows that
T
X22 = | X2(t)p2(t) at
We now proceed to generalize (2A.2) and (2A.8) to the mth function x,,(t),
by induction. Suppose that for all k <m
x,{t) = 3 ot) fe Oy ae | (2A.9)
where
Xue = | xalC)by(t) dt (2A.10)
and where the {@,(t), n = 1, 2,..., k} are mutually orthogonal and each has
unit energy. Then define
Xmn = [ xa(0dha() 7 ee 52 eer ees | (2A.11)
and
Xnlt) —"Y. Xmn ball
bn(t) = ae (2A.12)
where
ee ‘a a ye (24.13)
It follows from (2A.11) and (2A.12) that
[ balt)d(0) dt =0 for alln<m (2A.14)
CHANNEL MODELS AND BLOCK CODING 119
and from (2A.12) and (2A.13) that ¢,,(t) has unit energy. Reordering (2A.12),
we have
Salt) = Y Xn alt) (2A.15)
and from (2A.14)
Lee = | i) dt (2A.16)
It thus follows that, for M finite-energy functions {x,,(t)}, the representation
(2.1.1) is always possible with N no greater than M.
Suppose, however, that a subset of these functions is linearly dependent;
i.e., that there exists a set of nonzero real numbers a,, a2, ..., a; for which
A, Xm,(t) + 42 Xm,(t) + °° + 4;Xm,(t) = 0
where m, <m, <°*: < mj.
In such an event, it follows that x,,,(t) can be expressed as a linear combi-
nation of X,,,(t)***Xm,_,(t) and thus as a linear combination of the basis func-
tions which generate these previous signal functions. As a result, it is not
necessary to generate a new basis function @,, (t) in order to add x,,(t) to the
set of represented functions. In this way, one (or more) basis functions may be
omitted and hence N < M. It should be clear that a basis function can be thus
skipped if and only if the set {x,,(t)} is not linearly independent.
PROBLEMS
2.1 (a) For the 16-signal set shown in Fig. P2.1a, transmitted over the AWGN channel, with equal a
priori probabilities, determine the optimum decision regions, and express the exact error probability in
terms of the average energy-to-noise density ratio.
(b) Repeat for the tilted signal set shown in Fig. P2.1b.
Mosse MB Me PK a x
X X3V2a
X Xa+ xX X X x x
a + x x a
—3q cae: a 3a <->
X aie gar X X X X
i V2a
X x
x X + X x X
—3a
(a) (b)
120 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
2.2 For the seven-signal set shown, transmitted over the AWGN channel, with equal a priori
probabilities
(a) Determine the optimum decision regions.
(b) Show that one can obtain an upper bound on P,, m= 1, 2,..., 7, and hence on P,, by
calculating the probability that the norm of the two-dimensional noise vector is greater than J6/2,
and calculate this bound.
x1 x
Figure P2.2
2.3 For the signal set of Prob. 2.2, obtain a union bound on P,. of the form of (2.3.4) for each m.
Compare the resulting bound on P, with that obtained in Prob. 2.2.
2.4 For the orthogonal signal set of M equal-energy signals transmitted over the AWGN channel, first
treated in Sec. 2.3
(a) Show that the error probability is given exactly by
P, = Py, = 1— Pr {y,, < y; for all m $ 1|x,}
where the {y,,} are the M observables.
(b) From this, derive Eq. (2.11.1).
(c) Letting & = &, log, M, where &, is the energy/bit, show that
aa of i za M1 ‘ if 6,/N, <|In2
1 _ —- =
basi DAE Np 1 ifé,/N,>In2
and consequently that lim P; is 0 if &,/N, > In 2 and is 1 if the inequality is reversed.
Hint: Use L’Hospital’s rule on the logarithm of the function in question.
2.5 (a) Show that, if M = 2*, an orthogonal signal set of M dimensions can be generated for any
integer value of K by the following inductive construction. For K = 1, let
oa oH where H, =
oe ES ie 1S pies ee
Xy
X2 é Ax- Ax-
= |»—H here Hx =
21 8 cree ii Mie Hy
Xox
Then for any integer K > 2
(b) Note that, for this construction, the first component of each signal vector is always equal to
+ ./&/M. Consider deleting this component in each vector, thus obtaining a signal set {x,} with M — 1
dimensions and normalized inner products among all vectors
1
—1
7 Bi &) = for all j #k
M-1
CHANNEL MODELS AND BLOCK CODING 121
where &’ = &(M — 1)/M, which is the signal energy after deletion of the first component. This new
signal set is called a regular simplex signal set.
(c) Show that P, for the regular simplex signal set is identical to that of orthogonal signals as
given in Prob. 2.4, but since the energy has been reduced in the simplex case
é' 1 &'M
Pi, + 4.1 of, 6
NMI N,(M — 1)
where the first parameter indicates the energy-to-noise density and the second gives the common
normalized inner product among all signal vectors.
(d) Show that, for any set of M equal-energy signals, the average normalized inner product
> d (%;, %&) =
€M(M — 1) 444
Pay =
and hence the set generated in (b) achieves the minimum.
(e) Generalize the argument used in (c) to show that, if all normalized inner products are equal to
p > —1/(M — 1), then
olson
2.6 (a) Show that an ideal lowpass filter with transfer function
{1
| — if Ww
Ho) = |W if |wa| <x
0 otherwise
has noncausal impulse response
sin nWt
h(t) =
mWt
(b) Show that, in response to a signal z(t), the response of this lowpass filter at time n/W will be
| 20h (=) —t
(c) Show then that the mechanization of Fig. 2.11 is equivalent to that of Fig. 2.9 with finite-time
integrators replaced by infinite-time integrators.
2.7 (a) Suppose that a signal set utilizes the basis functions of Table 2.1, Example 1, but that at the
receiver the frequency and phase are incorrectly known so that the function
” sin [xW(t — n/W)]
dt = [ z(t)
+5 mW(t — n/W) =
$,(t) = ./2N/T sin [(@o + Aw)t+ @] (n—1)T/N<t<nT/N
is used. Assuming @) T/N > 1 and AwT/N « 1, show that the observables are attenuated approxi-
mately by the factor
; (=) (=)
cos @ sin | —— } /|——
N N
(b) For the basis functions of Table 2.1, Example 2, assume w is known exactly at the receiver
but that the phase is incorrectly assumed to be @. Show that, if @) T/N > 1, the signal components
of the observables become approximately
Von = X2n cos wy) + Xon+1 sin og
Von+1 Ny sin ty) + X2n+1 COS g
122 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(c) For part (b), let M = 4and N = 2 (quadriphase transmission). Show how the decision regions
are distorted by the incorrect ¢, and obtain expressions for the resulting error probabilities.
2.8 (a) For the signal set of Fig. 2.12b transmitted over the AWGN channel, with equal a priori
probabilities, determine the optimum decision regions.
(b) Show that for all m
P< Py = 2P
where P = Q(./2&,/N, sin (2/16)).
(c) Compare this lower bound on P, with the exact expression for P, of the signal set of
Fig. 2.12a (Prob. 2.1), and thus determine which set is superior in performance for equal average
energies.
2.9 (a) For the binary input AWGN channel with octal output quantization shown in Fig. 2.14 obtain
explicit expressions for the transition probabilities
k=1,2
b.
P(b;| 4) f= 1 2,.5:,8
(b) Give the optimum decision rule in as compact a form as possible.
2.10 (Chernoff Bound)
(a) Let z be a random variable such that its distribution (density) p(-) has finite moments of all
order. Show that
Pr (z>0)= Y' p(z) < df (2)p(z) = Eff (2)}
z>0
where
1 z>0
fQ2\) J 20
(b) Choose f(z) = e??, p > 0, and thus show that Pr (z > 0) < Efe’?], p > 0.
(c) In (2.3.12), let z(y |x,,) = In [py(y|X,1-)/Py(y|X,)] where y has distribution (density) p,(y |x,,).
Using (b), show that
P,(m— m) < E,[Py(¥ |Xmw)/Pul¥ [Xm]? = 2. Pwl¥ [Xm ?Pwl¥ |X)? = p =O
(d) Show that the bound in (c) reduces to the Bhattacharyya bound when p = 1/2.
(e) Consider the asymmetric binary “Z” channel specified by
Let x,, = 00... 0 and x,, = 11... 1 be complementary N-dimensional vectors. Show that the Chernoff
bound with p optimized yieids
P,(m—m’) < p*
and show that this is the exact result for maximum likelihood decoding. Compare with the Bhattac-
haryya bound.
2.11 (a) Show that the code whose parity-check matrix is given in Fig. 2.17 has the generator matrix
Poe a eee
Gul oe ee ee
SOc Ae ee
OC 4 36:01 53-4 F
CHANNEL MODELS AND BLOCK CODING 123
(b) Generalize to obtain the form of G for any (L, K) Hamming single-error correcting code
where L = 24-* — 1, L— K >2.
(c) Show that, for all the codes in (b), d,., = 3.
2.12 (a) For binary orthogonal codes, show that the expected number of symbol errors y occurring on
a BSC defined by hard quantizing an AWGN channel is
2é,K
E(n) = Lp = Lo S| where L = 2*
and that the variance is var [y] = Lp(1 — p).
(b) For large K and L = 2*, show that
E{n) = Lp = ie fs -(Ka-*)| +5
var [y] = Lp(1 — p) = L/4 as K > 00
and that
(c) Since d,,;, = L/2 for the codes of (a), show that the bound (2.10.14) can be expressed as
Py < Pr {n > L/4}.
(d) Using (b) and the Chebyshev inequality show that
>. 4| < 4 = 2~(-2)
eee | Sea 2
Pr
L
I~ >
Thus, Pr {yn < L/4}-+0 as K > and consequently the bound of (c) approaches unity.
2.13 Consider the following normalized four-level quantizer used with the AWGN channel
—2 | -1 | +1 | +42 26
pre 0 a on No
(a) Show that the resulting binary-input quaternary-output channel is symmetric with transition
probabilities
Po(—2) = Q(x + a) = p,(+2)
Po(— 1) = Q(x) — Q(x + a) = p,(+1)
Po( + 1) = Q(x — a) — Q(x) = ps(—1)
Po( +2) = 1 — Q(x — a) = pi (—2)
(b) Evaluate
d=-—In 2[./ Po( +2)Po(—2) > J Po( + 1)po(— 1)]
and optimize for &,/No = 2.
2.14 Generalize (2.12.8) to noncoherent detection of M orthogonal signals.
(a) Show that
P,, =1—Pr {y, > y; for all i # 1|x;,}
ame dy,
v1 Mf
| Plvalss) dya|
0
- [ posls
where p(y, |x,) and p(y,|x,) are given by (2.12.7).
(b) Substitute as justified in (a) to obtain
P;, = | Py; |x,)[1 —(f- C7 dy,
0
124 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(c) Show that the integral in (b) reduces to the finite sum
=
P, = P,, = e78!Ne ¥ (—1p a of iNo)
j=2 M J
2.15 (Continuation and Bound)
(a) Show that the term in brackets in Prob. 2.14(b) is upper bounded by
1—(1—e°”/?)M~! < min [(M — l)e"””?, 1] <[(M—-l1)e-”"??)_ O<p<1
(b) Use this to show that
P, <(M — 1)? exp -< (+) 0<p<l
which is the same as the bound (2.5.12) for coherent detection and leads directly to the exponential
bound (2.5.16).
(c) Give an intuitive argument for the perhaps unexpected result of (b).
2.16 Consider a binary linear block code with K = 4, L= 7 and generator matrix
1s Se
rie is Sk | Rs,
oo 4 | ae
0 0 0 j ae
(a) Find a parity-check matrix H for this code.
(b) Suppose we use this code over a BSC and the received output isy = (110111 1). What is the
maximum likelihood decision for the transmitted codeword ?
(c) Repeat (b) for y= (100110 1).
(d) What is the minimum distance of this code?
2.17 Consider M completely known, orthogonal, time-limited, equal-energy, equally likely signals
Kt: 3 Xe) where
r
| xWd)x(t) dt = 66,
0
These signals are used for digital communication over the usual additive white Gaussian noise channel
with spectral density N,/2. Consider a receiver that computes
”) yi
= Feral y(t)x,(t)dt k=1,2,...,M
and decides m,, when A,, = max, {A,} provided that max, {A,} > 6. If 4, < 6 for all k, then the receiver
declares an erasure and does not make any decision. Let 6 > b = ./26/No.
(a) Find the probability of an erasure.
(b) Find the probability of a correct decision.
2.18 Consider detection of a signal of random amplitude in additive white Gaussian noise such that
Hy = y(t) = n(t) C<re F
H, = y(t) =x@(t)+n(t) O<t<T
where
E{n(t)n(t + t)} = 6(t) and [ oC) dt =1
and x is a Gaussian random variable with zero mean and unit variance. What is the minimum average
error probability when both hypothesis H, and H, have a priori probability of 5?
CHANNEL MODELS AND BLOCK CODING 125
2.19 Consider the three signals
0 elsewhere
k.= 0, 1, 2
to be used to send one of three messages over an additive white Gaussian noise channel of spectral
density N,/2.
(a) When the messages are equally likely, show that the minimum probability of error is given by
rel ae fnals
(b) Find the minimum probability of error when the a priori probabilities are
da
To = Pr {mz is sent}
ge
2
m, = Pr {m, is sent} = 4
™ = Pr {m, is sent} = 0
(c) Find the minimum probability of error when
Ko = © n, =4 1, =3
2.20 Consider the detection of two equally likely signals in additive colored Gaussian noise where
E[n(t)n(s)] = $(¢, s)
H, — y(t) = x,(t) + n(¢)
H, y(t) = x,(t) + nl?) soe
Suppose that the functions w,, W,, ..., W,, and the constants 07, 03, ..., 02 satisfy the equation
T
| o(t, sb,(s) ds =o2y,(t) O<t<T k=1,2,...,m
0
where
x
[ ve) at = 6,
0
Suppose that the signals are
x,(t)= ¥ Ve Y(t)
k=1
eee -¥ Va v,(t)
and let
y= | vlevale) dt k= 1,2, .:.,m
126 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(a) Show that the minimum-probability-of-error decision rule is
oe Vé,
if yx (-=-]>0 choose H,
k=1 oj,
Otherwise choose H,
(b) Using the optimum decision rule, find an exact expression in terms of Q(-) for the error
probability as a function of 07, 03, ...,02, and &,, &,..., &,- Check your answer for the special case
where
N
of = — Sa Ne peers
2
with 6 = &,+&,+°°:'+6,, denoting the total signal energy.
2.21 [Staggered (Offset) QPSK and Minimum Shift Keying (MSK)]|
Consider the signal set generated by binary modulation (x, = +1 for each k) of the basis vectors
$2,(t) = : fle — (n — 3)1] sin wot (n—1)0<t<nt
0
otherwise
2
pt) 7 f(t — nt) cos wot (n—4)t<t<(n+4)t
0 otherwise
ic ; 3
t= N Wo is a multiple of 2x/t
(a) [Staggered (Offset) QPSK (SQPSK)]
Let
oe —t/2<t<t/2
al otherwise
f(t)
Show that the performance with optimum demodulation is the same as for QPSK, and that the
spectral density of the modulation sequence A }, x; ?;(t) for a random binary sequence {x,} is the same
as for QPSK:
S(w) = 3[S,(@ — @) + S,(@ + @)]
where
ay = =)
@T
(b) Comparing (a) with Prob. 2.7(b), show that the cross-channel interference effect of the phase
error @ is reduced relative to ordinary QPSK.
(c) [Minimum Shift Keying (MSK)]
Let
Tt
ee } V2 cos — —1/2<t<t1/2
0
otherwise
Show that the performance is the same as for QPSK and SQPSK with optimum demodulation.
(d) For MSK, show that for random binary modulation in the interval (n — 4)t < t < nt/2
CHANNEL MODELS AND BLOCK CODING 127
the signal can be expressed as
Xan Pan(t) + X2n+1 P2neilt) = +(2/,/z) cos [(@ + 2/t)¢]
which amounts to continuous phase frequency shift keying.
(e) Show that the spectral density of MSK can be expressed in the form given in (a) but with
sum (5) el
which decreases for large frequencies as w~* rather than w~? as is the case for QPSK and SQPSK.
CHAPTER
THREE
BLOCK CODE ENSEMBLE PERFORMANCE
ANALYSIS
3.1 CODE ENSEMBLE AVERAGE ERROR PROBABILITY:
UPPER BOUND
In Chap. 2 we made only modest progress in evaluating the error performance of
specific coded signal sets. Since exact expressions for error probability involve
multidimensional integrals which are generally prohibitively complex to calculate,
we developed tight upper bounds, such as the union-Bhattacharyya bound (2.3.16)
and the Gallager bound (2.4.8), which are applicable to any signal set. Never-
theless, evaluation of these error bounds for a specific signal set, other than a few
cases such as those treated in Sec. 2.11, is essentially prohibitive, and particularly
so as the size of the signal set, M, and the dimensionality, N, become large. It
follows that, given the difficulty in analyzing specific signal sets, the search for the
optimum for a given M and N is generally futile.
Actually, the exit from this impasse was clearly indicated by Shannon [1948],
who first employed the central technique of information theory now referred to,
not very appropriately, as “random coding.” The basis of this technique is very
simple: given that the calculation of the error probabilities for a particular set of
M signal (or code) vectors of dimension N is not feasible, consider instead the
average error probability over the ensemble of all possible sets of M signals with
dimensionality N. A tight upper bound on this average over the entire ensemble
128
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 129
turns out to be amazingly simple to calculate. Obviously at least one signal set
must have an error probability which is no greater than the ensemble average;
hence the ensemble average is an upper bound on the error probability for the
optimum signal set (or code) of M signals of dimensionality N. Surprisingly, for
most rates this upper bound is asymptotically tight, as we shall demonstrate by
calculating lower bounds in the latter half of this chapter.
To begin the derivation of this ensemble upper bound, consider a specific code
or signal set’ of M signal vectors x,, X>, ..., Xj, each of dimension N. Suppose
there are Q possible channel inputs so that x,,,, € X = {a,, d2,..., dg}, and m = 1,
2,...,M;n= 1, 2, ..., N. As discussed in Sec. 2.7, these inputs may be taken as
amplitudes, phases, vectors, or just as abstract quantities. In any case, this ensures
that there are in all exactly Q”” possible distinct signal sets with the given par-
ameters, some of which are naturally absurd such as those for which x; = x; for
some i + j. Nevertheless, if Pe (x;, X2,..., X,) is the error probability for the mth
message with a given signal set, the average error probability for the mth message
over the ensemble of all possible Q” signal sets is
Py = Pex oan DD 2 Pen X25 +++» x m) i fash NS Meee | (3.1.1)
Xi X2
where each of the M summations runs over all Q* possible N-dimensional Q-ary
vectors from x = (a,, a, ..., a,) to x = (ag, ..., dg). Hence the M-dimensional
sum runs over all possible Q”™ signal sets and we divide by this number to obtain
the ensemble average.
For the sake of later generalization, we rewrite (3.1.1) as
=> asi 2 n(X1)4n(X2) -- Qn(Xu)Pe,(X1; Nn ic wy aad
X1 X2
moe 82M (3.1.2)
where qy(x) will be taken as any distribution over 2; for now, however, we
continue with the uniform weighting of (3.1.1) and take
dts) = of ee 17 oa AE (3.13)
In Sec. 2.4 we derived an upper bound on P,, for any specific signal set, namely
p
PEIN; Bs +s: Sie) Se Pal hed, Pa aap p>0
y m’' =m
(3.1.4)
This Gallager bound is more general than the union-Bhattacharyya bound of
Sec. 2.3 to which it reduces for p = 1. Initially, for the sake of manipulative
simplicity, we consider only m = 1. Then inserting (3.1.4) into (3.1.2) and changing
' Throughout this chapter we shall use the pairs of terms code and signal set, and code vector and
signal vector, interchangeably.
130 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
the order of the summations, we obtain as the upper bound on ensemble error
probability when message 1 is sent
P;, < >; > an( X1)Pw(y |x1) Benes ly Patt 2 Qn (X2)n(X3) °** Gn(Xm)
eet x2 x3
3 t mig PSU, 12.15)
m’'=2
To proceed further, we must restrict the arbitrary parameter p to lie in the unit
interval 0 < p < 1. Then limiting attention to the term in braces in (3.1.5) and
defining
M
fn (X2, eeey Xu) = 3 ply xfer ** 0 = p < 1 (3.1.6)
m’=2
we have from the Jensen inequality (App. 1B)
3 eS 2 Gn(X2)4n(X3) --- dv(Xm)L fv (X25 «++» Xu)]?
X2 =X3
p
nis PY ie 2 Gn (X2)4n(X3) --- Qv(Xm) fv (X25 ---» Xm) (3.1.7)
X2 X3
since f§ is a convex - function of f for all x when 0 < p < 1. Here qy(x) > 0 and
¥ ay(x) = 1 (3.1.8)
Next, using (3.1.6), we can evaluate the right side of (3.1.7) exactly to be
> oo 2, Qn(X2) *** Qn(Xm)fy(X2 *** Xm)
p
2 :
= [SoS aula) “ante pvt neat
| X2 XM m =
act :
om 3 5 aul Pv(y Xm)
im’! =2 xm’
ve 1) anbodnaty xy[ (3.1.9)
where the last step follows from the fact that each vector x,,, is summed over the
same space 2. Combining (3.1.5) through (3.1.7) and (3.1.9) and recognizing
that, since the factors of the summand of (3.1.5) are nonnegative, upper bounding
any of them results in an upper bound on the sum, we have
Pg, < (M — 1) © Y an(X1)pyly |x.) + » ancxontyixytr*
y Xi
ishp
=(M=1PE[Eavtooniyl*]” O<pst G10)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 131
Now if, for any m + 1, we were to interchange the indices 1 and m throughout the
above derivation from (3.1.5) on, we would arrive at the same bound which is
consequently independent of m. Thus, trivially over-bounding M — 1 by M, we
obtain finally the following upper bound on the ensemble average error probabil-
ity when any message m is sent
it+p
Py, <M? Y |¥ an(x)pw(y |x)! *” O<p<1 m=1,2,...,M
y x
(3.1.11)
This bound is valid for any discrete (Q-ary) input and discrete or continuous output
channel, provided in the latter case we replace the summation over Y, by an
N-dimensional integral and take py(- ) to be a density function. It is also note-
worthy that the steps followed in deriving (3.1.11) are formally similar to those
involved in the derivation of P, for orthogonal signals over the AWGN channel in
Sec. 2.5. This similarity will become even more striking in the next section.
Note, however, that we have not yet restricted the channel to be memoryless. If
we So restrict it, we have
Py(y |x) = Fhe) (3.1.12)
If we also restrict gy(x) to be a product distribution
qu(x) = ats) (3.1.13)
[which is trivially true for the special case (3.1.3) in which q(x) = 1/Q] then upon
inserting (3.1.12) and (3.1.13) in (3.1.11), we have for a memoryless channel that
P,. <M? yo hee ~ YY ¥ alx1)p(yi |x)!” sig
Ji: J2 X1 X2 XN
l+p
1/(1+ p)
x q(Xy)P(Yn | xy)
eid
= Mil Y aber)o(vs xe]
Ap ae at
x 5 > 4xn)P(yn | xy) *” 4
YN LXN
1+p\N
O<p<l (3.1.14)
a asly sy")
_ mrly.
where p(y|x) is the symbol transition probability (density). [In the special case
(3.1.3), q(x) = 1/Q for all x e 2]
Before proceeding to evaluate the consequences of the elegantly simple result
(3.1.14), let us generalize it slightly. We began in (3.1.1) and (3.1.2) by taking a
132 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
uniform average over the entire ensemble of possible coded signal sets. However,
for some signal sets and for some channels, it will develop that certain choices are
preferable to others. In evaluating an average where the ultimate goal is to bound
the performance of the best member of the ensemble, it is logical that, based on
some side information or intuition, we might wish to weigh certain sets of signal
vectors (or certain signal vectors, or certain symbols or components of signal
vectors) more heavily than others. An appropriate, though banal, example would
be to use the average test score of a class of students to lower bound the score of
the best student. However, if an instructor’s experience is that red-haired, green-
eyed students generally perform above average and green-haired, red-eyed stu-
dents perform below average, he may choose to use a weighted average which
weighs the score of any student from the first group most heavily, that of any
student from the second group least heavily, and that of any other student some-
where between the two extremes. The only constraint is that the sum of the
(nonnegative) weights be unity or, equivalently, that the vector of weights be a
distribution vector. If the instructor’s bias is justified, this weighted average will
then be a tighter lower bound on the performance of the best student than the
original uniform average, but it will always be a valid lower bound regardless of
the validity of bias.
We can easily achieve such a priori biasing from (3.1.2) on by allowing qy(x)
to be any distribution on the Q” possible signal vectors. Thus (3.1.2) may be
regarded as a weighted ensemble average where the weighting of the signal sets,
which are members of the ensemble, are given by the product measure
[ [v=1 ay(Xn). The same may be said of all subsequent ensemble averages through
(3.1.11). For a memoryless channel, defined by (3.1.12), we further restrict this
arbitrary weighting to be of the form (3.1.13) which corresponds to weighting each
component of each codeword independently according to q(x). For many classes
of channels, including all binary-input, output-symmetric channels, a nonuniform
weighting does not reduce the bound on the ensemble average error probability.
For others such as the Z channel (Prob. 3.1), there is a marked improvement at
some rates. And clearly, if by nonuniform weighting of the members of the en-
semble we manage to reduce this average, then the best signal set must perform
better than this newly reduced average. The advantage of nonuniform weighting
depends generally on the skewness of the channel.
We may express (3.1.14) alternatively in terms of the data rate per dimension
In M
Raves
nats/dimension (3.1.15)
which is of course related to the rate R,; in nats per second defined in Sec. 2.5 by
R=R,(T/N)
= R;./2W (3.1.16)
Thus, since M = e’®, we obtain for memoryless channels
Py < eta @ eR Og pica (3.1.17)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 133
where?
1+p
E,(e, q) = —In >) |>) a(x)p(y| x)" *? (3.1.18)
and where q = {q(a;), q(a2), ..., q(ag)} is an arbitrary distribution vector; that is, q
is an arbitrary vector over the finite space # = {a,,a2,..., ag} with the properties
q(x)>0 + foreveryxe
and
bh Fee (3.1.19)
We observe finally that, since (3.1.17) is an error probability bound for any
message sent, it must also be a bound on the ensemble average of the overall error
probability P, no matter what the message prior probabilities may be, provided the
maximum likelihood decision rule is used. Also, since p is arbitrary within the unit
interval and q is an arbitrary distribution vector subject to the constraints (3.1.19),
we may optimize these parameters to yield the tightest upper bound. This is
achieved, of course, by maximizing the negative exponent of (3.1.17) with the
result that the average error probability over the ensemble of all possible signal
sets for a Q-ary input memoryless channel may be bounded by
Poe pe (3.1.20)
where
E(R) = max max [E,(p, q) — pR]
q O<p<1
E,(p, q) is given by (3.1.18) and q is a distribution vector subject to the constraints
(3.1.19). It obviously follows that at least one signal set in the ensemble must have
P, no greater than this ensemble average bound.
We leave the detailed discussion of this remarkably simple result to the next
section where we utilize it to prove Shannon’s channel coding theorem.
3.2 THE CHANNEL CODING THEOREM AND ERROR
EXPONENT PROPERTIES FOR MEMORYLESS CHANNELS
The key to assessing the value of the bound on the ensemble average error
probability given by (3.1.20) lies in determining the properties of the function
E,(p, q) given by (3.1.18). The important properties of this function that depend
only on the memoryless channel statistics {p(y|x)} and the arbitrary input
weighting distribution q(-) are summarized in the following.
> The function E,(p, q) appears in other bounds as well. It was first defined by Gallager [1965] and
is referred to as the Gallager function.
134 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Lemma 3.2.1 (Gallager [1965]) Let
Ee.a)= — nS |S aextryr
where q(-) is a probability distribution over the finite space Y = {a;, az, ...,
ag}, and suppose that?
= 3 ) q(x)p(y |x) In y 2 h05 #0 (3.22)
is nonzero. Then the function E,(p, q) has the following properties:
E,p,q)=>0 p2d
b
EM, GPSOnco et Sp se ae
with equality in either case if and only if p = 0; and
0E,(p, 4)
om aura - 2.4
ap >O0 p>-tl (3.2.4)
0°E,(p, 4)
ia tege <0 p>-l (3.2.5a)
with equality in (3.2.5Sa) if and only if
etal ey er (3.2.5b)
In ’ ’ ee
» a(x')p(y|>’)
for all x e #, ye Y such that q(x)p(y|x) > 0.
In (3.2.2) and (3.2.5b) we find the function I(q), called the average mutual
information of the channel, first defined in Sec. 1.2* where it was shown to be
nonnegative. Direct substitution of p = 0 in (3.2.1) shows that E,(0, q) = 0 and
hence that the inequalities (3.2.3) follow from (3.2.4); the proof of inequalities
(3.2.4) and (3.2.5a) is based on certain fundamental inequalities of analysis.
Appendix 3A contains these inequalities and gives the proof of (3.2.4) and (3.2.5a).
Thus, in all cases except when the condition (3.2.5b) holds, E,(p, q) is a posi-
tive increasing convex - function, for positive p, with a slope at the origin equal to
3 I(q) = I(%; Y) was first defined in Sec. 1.2. Henceforth, the channel input distribution is used as
the argument, in preference to the input and output spaces, because this is the variable over which all
results will be optimized.
* Note that average mutual information here evolves naturally as a parameter of the error probabil-
ity bound,.while in Sec. 1.2 it was defined in a more abstract framework.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 135
Eo (p, q)
te Slope = J(q)
p Figure 3.1 Function E,(p, q).
I(q). An example is sketched in Fig. 3.1. On the other hand, if (3.2.5b) holds, the
second derivative of E,(p, q) with respect to p is zero for all p, and consequently in
this case E,(p, q) = pI(q). While it is possible to construct nontrivial examples of
discrete channels for which (3.2.5b) holds (see Prob. 3.2), these do not include any
case of practical importance.
Then restricting our consideration to the case where (3.2.5a) is a strict inequal-
ity, we have that, for any particular distribution vector q, the function to be
maximized in (3.1.20), [E,(p, q) — pR], is the difference between a convex - func-
tion and a straight line, and hence must itself be convex - for positive p as shown
in Fig. 3.2. Defining
E(R, q) = max [E,(p, q) — pR] (3.2.6)
O<p<i
Eo(e, q)— pR
p
0 1
(a) R < 0Eo(0, q)/0p|,-1
Eo (p, q) rae pR
p
0 ae
(b) R = dE (0, q)/8p|,-,, > 9Eo (0, )/9p|,-1 Figure 3.2 Function E,(p, q) — pR.
136 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
we note that, as a consequence of Lemma 3.2.1, the exponent has a unique maxi-
mum. For small R (Fig. 3.2a), the maximum of [E,(p, q) — pR] occurs for p > 1
and, consequently, the maximum on the unit interval lies at p = 1. For larger R
(Fig. 3.2b), the maximum occurs at the value of p for which R = dE,(p, q)/ép.
Since the second derivative is strictly negative, the first derivative is a decreasing
function of p, so that we can express the maximum of (3.2.6) for low rates as
E(R,q)=E,(1,q)-—R O<R<0OE,(p, q)/Op (3.2.7)
p=1
while for higher rates we must use the parametric equations
E(R, 4) = E,(p, 4) — peE,(p, 4)/6p (3.2.8)
R = 0E,(p, q)/Op
0E,(p, 4)/0p ae R < 0E,(p, q)/0p
p
= I(q)
p=0
For this higher-rate region, the slope is obtained as the ratio of partial derivatives
dE(R, q) _ OLE.(p, 4) — poE.(p, q)/Op|/ep
dR OR/dp
=-9 (3.2.9)
and the second derivative is
d°E(R,q) _ O[dE(R, 4)/4R]/op
dhe: OR/ép
: ot
— 0E,(p, q)/6p?
>0 (3.2.10)
Hence, while E(R, q) for low rates is linear in R with slope equal to — 1, for higher
rates it is monotonically decreasing and convex vu. Its slope, which is —p
(0 < p < 1), increases from —1 at R = 6E,(p, q)/dp |,-1, where it equals the slope
of the low-rate linear segment, to 0 at R = I(q) = 0E,(p, q)/0p|,-0, where the
function E(R, q) itself goes to zero. A typical E(R, q) function demonstrating these
properties is shown in Fig. 3.3a.
For the special class of channels for which condition (3.2.5b) holds so that the
second derivative of E,(p, q) is everywhere zero, we have
E(R, q) = ural (q) — R]
=I(q)-R O<R<I(q) (3.2.11)
Thus, as shown in Fig. 3.3b, the curved portion of the typical E(R, q) function
disappears and only the linear segment remains.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 137
E(R, q)
Eo (1, q) E(R, q)
I(q)
|
|
|
|
|
|
R
0 dE (e, q)/dpl,_, I(q) 0 I(q)
(a) 8? Eo(p, q)/p” <0 (b) 8° Eo(p, q)/dp? = 0
Figure 3.3 Examples of E(R, q) function.
The negative exponent E(R) of (3.1.20) is, of course, obtained from E(R, q) by
maximizing over all possible distribution vectors. That is
E(R) = max E(R, q) (3.2.12)
q
q = (q(x): x € 2}
where
with the properties
qix)>0 = forallxe®
: Y a(x) = 1
Note that, as a consequence of these distribution constraints, the space of allowed
q is a closed convex region. For certain channels, including some of greatest
physical interest (see Sec. 3.4), a unique distribution vector q maximizes E(R) for
all rates; for other channels (Prob. 3.3c), two or more distributions maximize E(R)
over disjoint intervals of R; for still other channels (Prob. 3.5), the maximizing
distribution varies continuously with R. Regardless of which of the above situa-
tions holds, we have shown that, as a consequence of Gallager’s lemma, E(R, q) is
a bounded, decreasing, convex U, positive function of R for all rates R,O < R <
I(q). E(R) as defined by (3.2.12) is then the upper envelope of the set of all
functions E(R, q) over the space of probability distributions q. It is easily shown
that the upper envelope of a set of bounded, decreasing, convex U, positive func-
tions of R is itself a bounded, decreasing, convex vu, positive function of R. Thus,
for all rates R
0<R<C=max I(q)
q
= max % y q(x)p(y |x) In , rare (3.2.13)
138 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
E(R) is a bounded, decreasing, convex U, positive function of R. C is called the
channel capacity. This then proves the celebrated channel coding theorem.
Theorem 3.2.1 (Shannon [1948] et al.°) For any discrete-input memoryless
channel, there exists an N-symbol code (signal set) of rate R nats per symbol
for which the error probability with maximum likelihood decoding is
bounded by
P,, < e NER) (3.2.14)
where E(R), as defined by (3.1.20) and (3.1.18), is a convex U, decreasing,
positive function of R for 0 < R < C, where C is defined by (3.2.13).
Channel capacity was first defined in conjunction with average mutual infor-
mation in Sec. 1.2. Like the latter, it emerges here naturally as a fundamental
parameter of the error bounds—namely, the rate above which the exponential
bound is no longer valid. Its significance is increased further by the converse
theorem of Sec. 1.3, as well as by that to be proved in Sec. 3.9.
In spite of its unquestionable significance, this coding theorem leaves us with
two sources of uneasiness. The first disturbing thought is that, while there exists a
signal set or code whose error probability P,, averaged over all transmitted
messages, is bounded by (3.2.14), the message error probability P,; for some
message or signal vector x,,, may be much greater than the bound. While this may
indeed be true for some codes, we now show that there always exists a signal set or
code in the ensemble for which P,,, is within a factor of 4 of the coding theorem
bound for every m.
Corollary For any discrete-input memoryless channel, there exists an N-
symbol code of rate R for which maximum likelihood decoding yields
Py She NEU RIES Gy SPO pp! 9215)
ProoF The proof involves applying the channel coding theorem to the en-
semble of codes of the same dimensionality but with twice as many messages.
Let us assume further, arbitrarily, that the 2M messages are all a priori
equiprobable. Then from the above theorem, we have that there exists at least
one code in the ensemble of codes with 2M messages for which
1 2M
P,(2M) = — P
E( ) 2M oe sF Em
mae Mia ay ania his (3.2.16)
since the rate of this code is In (2M)/N. Now suppose we discard the M code
° Shannon actually proved that P, +0 as N > oo, while the exponential bound was proved in
various progressively more explicit forms by Feinstein [1954, 1955], Elias [1955], Wolfowitz [1957],
Fano [1961], and Gallager [1965].
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 139
(signal) vectors with highest P,;, . This guarantees that the remaining M code
vectors have
Pe en (3.2.17)
for, if this were not so, just the average of the M code vectors with largest
error probabilities would exceed the bound (3.2.16). Substituting (3.1.20) for
the exponent, we have for rate In (2M)/N
Pr, <2 exp {—N max max [E,(p, q) — p(In M)/N — p(In 2)/N]}
evap ~ 4
<2 exp {—N max max [E,(p, q) — p(In M)/N — (In 2)/N]}
q O<p<i
= e-NER) (3.2.18)
for each of the M code vectors. (Note that, while the above development was
for the code set of 2M messages and the corresponding maximum likelihood
decision regions, reducing the set to M messages can only reduce P,, by
expanding each decision region.) This proves the corollary.
The second disturbing thought is that, even though a code exists with low
error probability, it may be difficult, if not nearly impossible, to find. We may
dispel this doubt quickly for ensembles where uniform weighting (that is,
q(x) = 1/Q for all x € & = {a,, az, ..., ag}) is optimum. For in this case at least
half the codes in the ensemble must have
PLSCOP:
<2¢ A)
for again, if this were not so, the ensemble average could not be bounded by P;, as
given by (3.1.20). For nonuniformly weighted ensembles, the argument must in-
clude the effect of weighting and reduces essentially to a probabilistic statement.
In any case, the practical problem is not solely one of finding codes that yield low
P,, but codes which are easily generated and especially which are easily decoded,
that yield low P,. This will be the problem addressed throughout this book.
Before leaving the coding theorem, we dwell a little further on the problem of
finding the weighting distribution q which maximizes the negative exponent of the
bound at each rate. To approach this analytically, it is most convenient to rewrite
the exponent as
E(R)= max [pr + max E,(p, a| (3.2.19)
O<p<i1 q
Thus we need only maximize E,(p, q) or, equivalently, minimize with respect to q
e Bole) — ( q(x)p(y | x)" alr
y
=¥ aly, q)'* (3.2.20)
140 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where
a(y, q) = >, a(x)p(y|x)"O*” (3.2.21)
x
The key to this minimization lies in the following lemma.
Lemma 3.2.2 The quantity exp [—E,(p, q)] is a convex U function on the
space of probability distributions q for all fixed p > 0.
PRrooF The convexity follows from the fact that «(y, q) is linear in q, while
the function «'* is convex u in « for all p > 0. Thus, by the definition of
convexity (App. 1A), for every y € Y, a(y, q)'*? must be a convex u function
of q. Finally, the sum of convex U functions must also be convex U; hence the
lemma is proved.
Also of interest, although it may be regarded as a byproduct of the maximiza-
tion of the exponent, is the problem of maximizing I(q) to obtain the channel
capacity C. It turns out that this problem is almost equivalent to minimizing
exp [—E,(p, q)] because I(q) has similar properties, summarized in the following.
Lemma 3.2.3 I(q) is a convex ~ function on the space of probability distribu-
tions q.
PRooF We begin the proof by rewriting the definition (3.2.2) of I(q) as
= 2 q(x) >) P(y|x) In p(y|x)
i/y abs)olr|»)
The first term is linear in q; hence it is trivially convex. The second term can
be written as
(3.2.22)
+ d y q(x)p(y |x) In
2 Bly) In [1/B(y)]
where
y= Dale) p(y |x)
is linear in q(x). But d?[ In (1/B)]/dB? = —1/B <0 since B>0. Hence
B In (1/B) is convex - in B, and f is linear in q. Thus by the same argument as
for the previous lemma, f(y) is convex ~ in q; and I(q), which is the sum
(finite, infinite, or even uncountably infinite) of convex 7 functions, is itself
convex o, thus proving the lemma.
The minimization (maximization) of convex U (4) functions over a space of
distributions is treated in App. 3B where necessary and sufficient conditions, due
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 141
to Kuhn and Tucker [1951], are derived for the minimum (maximum). For appli-
cation to the problems at hand, the following theorems are proved there.
Theorem 3.2.2: Exponents Necessary and sufficient conditions on the dis-
tribution vector q which minimizes exp [—E,(p, q)] [or, equivalently, maxi-
mizes E,(p, q)], for p > 0, are
Y p(y|x)O aly, ql? = Yaly,q)'*? = for all x e X = {ay, az, ..., ag}
y
(3.2.23)
where «(y, q) is defined by (3.2.21), with equality for all x for which q(x) > 0.
Theorem 3.2.3: Average mutual information Necessary and sufficient condi-
tions on the distribution vector q which maximizes [(q), to yield C, are
p(y |x)
~1<C forallx e # = {a,, a2,..., a
dP (y|x) “|X ae p(y |x’) {a, 2 o}
(3.2.24)
with equality for all x for which q(x) > 0.
The above theorems do not give explicit formulas for min, exp [—E,(p, q)]
and C. However (3.2.23) and (3.2.24) do serve the purpose of verifying or disprov-
ing an intuitive guess for the optimizing distribution. As a very simple example,
for a binary input (Q = 2) output-symmetric channel as defined in Sec. 2.9, these
necessary and sufficient conditions verify the intuitive fact that the optimizing
distribution in each case is uniform [g(a,) = q(az) = 4]. In the special case where
the output space as well as the input space is {a,, a2, ..., dg} and where the Q by Q
transition matrix {p(y|x)} is nonsingular, it can be shown (Prob. 3.4) that the
conditions of both theorems are easily satisfied with the inequalities all holding
with equality, and explicit formulas may be obtained both for the optimizing q
and for the quantities to be optimized. In general, however, the maximization
(minimization) must be performed numerically. This is greatly facilitated by the
fact that the functions are convex, which guarantees convergence to a maximum
(minimum) for any ofa class of steepest ascent (descent) algorithms. Appendix 3C
presents an efficient computational algorithm for determining channel capacity.
Similar algorithms for computing E(R) have been found by Arimoto [1976] and
Lesh [1976].
Even when the optimum q is known and is the same for all rates, the actual
computation of E,(p, q), E(R), and C is by no means simple in general. Usually the
simplest parameter to calculate is
E,(1, 4) = -n 5 |S ab x)./P(y |x)
Since E(R) is a decreasing function of rate, this provides a bound on the exponent.
Also, [E,(1, q) — R] as given by (3.2.7) is the low-rate exponent when maximized
2
(3.2.25)
142 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
over q. For any binary-input channel, this maximum is readily evaluated, since the
optimizing distribution is uniform, and yields
E,(1) = max E,(1, q)
q
= In2—In (1+ Z) (3.2.26)
where
Z => Vor) (3.2.27)
y
For the special cases of the BSC and AWGN channel,° Z is readily calculated to
be
Z = ./4p(1 — p) (BSC) (3.2.27b)
Zee tN (AWGN) (3.2.27c)
It follows upon applying the Schwarz inequality (App. 3A) to (3.2.27a) that
ae An ae &
Now suppose the rate is sufficiently low that the linear bound (3.2.7) is appro-
priate. Then, for a binary-input channel, we have
Py < e7 NEo(1)— RI
is Me Nila 2-In (1 + Z)] (3.2.28)
On the other hand, we showed in Sec. 2.9 that for linear binary block codes used
on this class of channels
M
Ppa rere (2.9.19)
k=2
where w,..., Wy are the weights of the nonzero code vectors and Z is given by
(3.2.27a). In particular, as was shown in Sec. 2.10, if we restrict M = N [so that
R = (In N)/N > 0as N > oo], then orthogonal codes exist for all N = 2* with the
property that w, = N/2 for all k # 1. In this case we have the bound
Pp <M e7Notind) (3.2.29)
Since in this latter case the rate approaches zero asymptotically with N, it is clear
that the bound (3.2.29) should be compared with the ensemble average upper
bound (3.2.28) as (In M)/N — 0 in both cases. It is easily shown that the negative
exponent of (3.2.29) dominates that of (3.2.28), that is
—tinZ >In? — in +2) Oecd (3.2.30)
with equality if and only if Z = 1.
The two exponents of (3.2.28) and (3.2.29) are shown in Fig. 3.4. Note from
© See (2.11.6) for the definition of po(y) for the AWGN channel.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 143
In2-In(1+Z)
In 2
Z
0 ] Figure 3.4 Negative exponents of (3.2.28) and (3.2.29).
the examples (3.2.27b and c) that Z = 0 applies to a noiseless channel and that Z
grows monotonically with increasing noise, with the channel becoming useless in
the limit of Z = 1. We note also from the figure that the curves diverge as Z —> 0,
while they have the same negative slope at unity.
Now it may not be surprising that a particularly good code (e.g., the orthog-
onal code) is far better than the ensemble average which includes the effect of
some exceedingly bad codes having two or more code vectors which are identical.
In fact, however, this discrepancy occurs only at low rates; we shall show in
Sec. 3.6 that, for R > 0E,(p, q)/ép |, -; [that is, over the curved portion of the E(R)
function], the best code can perform no better asymptotically than the ensemble
average. Nevertheless, if at very low rates certain bad codes can cause such a
dramatic difference between the ensemble average and the best code, it stands to
reason that the ensemble average as such is not a useful bound at these rates.
While this might lead the more skeptical to discard the averaging technique at this
point, we shall in fact see that, with cleverness, the technique may be modified in
such a way as to eliminate the culprits. This modification, called an expurgated
ensemble average bound, is treated in the next section and shown for the special
case of binary-input, output-symmetric channels to yield the exponent of (3.2.29)
at asymptotically low rates.
3.3 EXPURGATED ENSEMBLE AVERAGE ERROR
PROBABILITY: UPPER BOUND AT LOW RATES
The approach to improving the bound at low rates is to consider a larger en-
semble of codes (or signal sets) with the same dimensionality N but having twice
as many code vectors, 2M. If our conjecture in the last section was correct, then
the error probability for most codes can be improved considerably by eliminating
the particularly bad code vectors. Thus we shall resort to the expurgation of the
worst half of the code vectors of some appropriate code of the ensemble. The
result will be a code with M code vectors of dimensionality N whose average
error probability can be shown to be much smaller at low rates than the upper
bound given in the channel coding theorem (Theorem 3.2.1).
144 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
In developing this expurgated bound, it is convenient to work not with the
ensemble average P,,, , but rather with the ensemble average of a fractional power
Py, where 0 < s < 1. We obtain such a bound in the form of
Lemma 3.3.1 For the ensemble of codes, defined by a distribution qy(x) on
%y with M vectors of N symbols used on a discrete-input channel, the en-
semble average of the sth power of the error probability for the mth message,
when maximum likelihood decoding is used, is bounded by
Pee Bo? eee m=1,2,...,M
where
B=M SY an(x)av(x)|E Pv PwWTX)| == G.3.1)
Consequently, the sum of these averages over all messages is bounded by
~
SPs, < MB (3.3.2)
Proor The derivation of (3.3.1) is along the lines of Sec. 3.1. First of all, since
we shall be interested principally in low rates, we use only the union-
Bhattacharyya bound, which coincides with the more elaborate Gallager
bound at p = 1. Hence from oe we have
Py (Xi, Xz; ---. Xe) = E Vont0TR ) dv VPw (Y |X)
m' =m
¥ bs (E Venton) )Pv(¥ [Xm )) (3.3.3)
|m’ =m
But if we restrict s to lie in the unit interval, we may use the inequality
be
which follows from the Hélder inequality (App. os to obtain
PE (xy Y |S VowtyTx )Pwl¥ |X’)
<4,
a Cae ee (3.3.4)
O<s<1 (3.3.5)
Now taking the ensemble average as in Sec. 3.1, we obtain
Pi, a a ~? Qn (X1) °** Qu (Xe )PE,(X1 °** Xe)
= ie ae Qn(X1) *** Qn(Xmz)
aon x1
=) ys >. Qn (X)qn(x’)
XK
» V Pwl¥ |Xm)Pw¥ | Xm)
» V Pwly|x)pwCy | x) (3.3.6)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 145
where the last step follows from the facts that each vector x,, is summed over
its entire space and )’,, dy(X,) = 1. From this, (3.3.1) follows trivially, as does
(3.3.2) since the former is a uniform bound for all messages; hence the lemma.
We now proceed to the key result which will induce us to expurgate half the
code vectors of a particular code with 2M code vectors to obtain a much better
code for M messages.
Lemma 3.3.2 At least one code, in the ensemble of all codes with 2M vectors
of N symbols, has
PE eR ORs Xt (3.3.7)
for at least M of its code vectors {x,,}, where B is given by Lemma 3.3.1 with
M = 2M.
Proor The proof is by contradiction and follows easily from Lemma 3.3.1.
Suppose Lemma 3.3.2 were not true. Then every code in the ensemble would
have
Pk, (X1, ---» Xam) = 2B (3.3.8)
for at least M + 1 of its code vectors {x,}. But then the sum over all code
vectors of the ensemble average of the sth moment would be lower bounded
du Pi = i iz Fe > Nn (x, )an is Gn(X2m)P¢,,(X1X2 oS as)
m=1 m=1x x X2M
2M
ae zy a qn (x, )an (x2) - = Gn(X2a) Ms Px, (X 1X2 aes Xom)
m=1
=2 - w(X1)Gn(X2) *** Gy (X2m¢)2B(M + 1)
where the last step follows from the facts that at least (M + 1) terms of the
(2M)-term sum are lower bounded by (3.3.8), that the remainder of the terms
are nonnegative, and that the weighting distribution factors qy(x,), qn(x2),
--,4n(X2m) are nonnegative. Finally, since the sum of the distribution over the
entire Q?“™ terms of the ensemble is unity, we have
which is in direct contradiction to (3.3.2) of Lemma 3.3.1 for M = 2M. The
lemma is thus proved by contradiction.
On the basis of this result, we note that if we expurgate (eliminate) the M code
vectors with the highest error probability P, from the code which satisfies (3.3.7)
146 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
of Lemma 3.3.2, we are left with a code of only M code vectors of dimension N
such that
Pek; Rees San) se ED ee eee SS t -(5/3.9)
where we denote the unexpurgated code vectors by {X,,}. However, in justifying
(3.3.9), we must note that the unexpurgated code vectors will have their error
probabilities altered by removal of the expurgated code vectors, but this alteration
can only lower the P,, of the remaining vectors, since removal of some vectors
causes the optimum decision regions of the remainder to be expanded. Hence
(3.3.9) follows. In combining (3.3.9) with (3.3.1) to obtain the explicit form of the
expurgated bound, if we make the further substitution p = 1/s, where 1 < p < «,
we obtain a result whose form bears a striking similarity to the form of the coding
theorem of Sec. 3.2. For we then have that for every message m of the expurgated
code,
Pe, < (AMY{S S alodantx]¥. P00 Rate RY | 1<p<x
(3.3.10)
If we finally impose the memoryless condition (3.1.12) on the channel,
N
Pv(y|x)= [] plas)
and similarly take the distribution to be a product measure,
N
= [] a
then we obtain by the identical set of steps used in deriving (3.1.14)
1/p\Np
Pe, < 4MYIY Yasar) Vly |x)P (y|x’) | l<p<o
(3.3.11)
To obtain the tightest bound, we must minimize with respect to the distribution
q(x) and the parameter p > 1. The result can be expressed in exponential form as
Theorem 3.3.1: Expurgated coding theorem (Gallager [1965]) For a discrete-
input memoryless channel there exists at least one code of M code vectors
of dimension N for which the error probability of each message, when
maximum likelihood decoding is used, is bounded by
Pr < e” NEex(R) oe ae ae ag Gee | (3.3.12)
where
E.,(R) = max sup (3.3.13)
q p2i
DY a(x)at
In 4
cinsi= = ["| oa
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 147
Note that, since the region of p is semi-infinite, the “maximum” over p
becomes a supremum. One slight inconvenience in the form of this theorem is the
appearance of the nuisance term (In 4)/N added to the rate. Of course, this term
is negligible for most cases of interest. In fact, this term can be made to disappear
by an alternative proof (Prob. 3.21); hence we shall ignore it henceforth. In
order to assess the significance of this result and the range of rates for which it is
useful, we need to establish a few properties of E,(p, q), which are somewhat
analogous to those of E,(p, q) discussed in Sec. 3.2. These are summarized as
Theorem 3.3.2 For any discrete-input memoryless channel for which I(q) + 0,
for all finite p > 1
E,(1, q) = E,(1, q) (3.3.15)
E,(p, q) > 0 (3.3.16)
0E,(p, q)
Pierce 0 (3.3.17)
0°E,(p, q)
Sa cies 0 (3.3.18)
with equality in (3.3.18) if and only if, for every pair of distinct inputs x and x’
for which q(x) > 0 and q(x’) > 0, p(y|x)p(y|x’) = 0 for all y. [If so, E,(p, q)
is just a constant multiple of p; such a channel is said to be noiseless.|
The equality (3.3.15) follows directly from (3.3.14) and (3.2.1) since
La)= —In dd aledale') & P|) |)
=-In) [Eats X)\/ P P(y|x) | = 2.0, q)
The remainder of the theorem, consisting of inequalities whose form is identical to
those for E,(p, q), is proved in App. 3A.
We note also that E,(p, q) and E,(p, q) are of interest for their corresponding
bounds over disjoint intervals of the real line except for the common point p = 1,
where the functions are equal. Figure 3.5 shows a composite graph of the
Figure 3.5 E,(p, q) and E,(p, q) for typical channel.
148 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
two functions for a typical channel. It can, in fact, be shown (Prob. 3.10) that
0E,(p, q)/Op |,=1 < OE,(p, q)/Op |,=, unless the channel is useless [that is, C =
max, I(q) = 0]. The maximization of (3.3.13) with respect to p is quite similar to
that of the ensemble average error exponent in Sec. 3.2. Letting
E.,(R, 4) = ap [E.(e, 4) — pR] (3.3.19)
and taking the channel to be other than noiseless or useless, we have from (3.3.18)
that E,(p, q) is strictly convex - in p, so that the supremum occurs at
0p 0p
Then in this region the negative exponent of (3.3.12) for a fixed q is given by the
parametric equations
(3.3.20)
e=%
0E,(p, 4)
E.,(R, 4) = E;(p, 4) — p :
p
. OE 5) OE, 9
Ra CEP: @) lim 0E.(p, 4) < R < CE 4) (3.3.21)
dp fig ee CP ak
Also, by exactly the same manipulation as in (3.2.9) and (3.2.10), we have
dE..(R, q) d°E.,(R, 4)
ex east) ex ; 3
dR r 7 een a)
so that the exponent is convex U with negative slope p > 1 over the region given
in (3.3.20). Furthermore, it is tangent to the straight line
E,(1, q)— R=E,(1,q)—R
at the point
R = 0E,(p, q)/6p| << 0E,(p, 4)/0p
=1
p= ” ag
E (1, q)
R
I(q) Figure 3.6 Composite exponent function.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 149
The composite of the E,,(R, q) and E(R, q) functions for a typical channel is as
shown in Fig. 3.6. For all physical channels, E,,(R) is bounded for all rates as
shown, for example, in Fig. 3.6. However, for certain nonphysical but interesting
channels (e.g., that in Fig. 3.7), E..(R) becomes unbounded for sufficiently small
but positive rates, and consequently the error probability is exactly zero for cer-
tain codes of finite length.
The low-rate behavior of both classes of channels can be determined by
examining E,(p, q) and its first derivative in the limit as p > oo. We note that
(3.3.21) holds only for rates greater than the limiting value of the derivative. This
value is readily determined from the definition (3.3.14), by use of L’H6pital’s rule,
to be
R,(o, q) = lim CE .(p, 4)
oder OP
= tim EA?)
= —In » » q(x)q(x')b(x, x’) (3.3.23)
where
1 if vty |x) (y|x’) #
0 otherwise
P(x, x’) vr
Thus
R(0,q)=0 if Y /POTX)POTX) £0 (3.3.24)
y
for all pairs of inputs x, x’ € ¥ while
R,(o,q)>0 ~~ if d J P(y|x)p(y |x’) = (33.25)
for some pair of inputs x, x’ € 2 such that q(x)q(x’) # 0. In the latter case, we note
also that, since according to (3.3.22) the slope of E.,(R, q) approaches — oo as
; a, & >) b,
! D
7
| ay by
—p
| 2 b
: D
l p zs
: a4 a sea b,
0 ind 2in 2-H)
(P A
Figure 3.7 Composite exponent function for channel ./.
150 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
p — oo, the function E,,(R, q) approaches infinity as R > R,(00, q) > 0 from the
right, and is thus infinite for all lower rates. An example of such an exponent and
the corresponding channel, which is the disjoint parallel combination of two
BSCs, is shown in Fig. 3.7 (Prob. 3.9). The intuitive explanation of this nonphysi-
cal result is that if a channel has two distinct inputs which cannot both reach the
same output for any particular output, then the exclusive use of these two input
symbols will result in error-free communication even without coding.
Returning to the physical situation where R,,(0o, q) = 0, it is of interest also to
determine the value of the zero-rate expurgated exponent. Here, according to
(3.3.17) and (3.3.19), and letting s = 1/p, we have
E.,(0, q) = sup E,(p, q)
p21
= lim E,(p, q)
pa
1
= tim = Lin SE abode) |S VeOTOT=)] 3.26)
s70 ee
Finally, using L’Hopital’s rule, we have
“2 2 ae) *’) In), / Ply |x)P (y |x’)
The optimization ik respect to q is exceedingly difficult because E,(p, q) is
not convex over q and can have several local maxima.’ Little is known about
the optimum weighting distribution except for a special class of channels
(Prob. 3.11) and for binary-input channels. In the latter case, it follows easily from
(3.3.14) that
(or) + atlas) + 20H » Vowrayora)| |
= ~p In [1 — 2q(a,)q(a,)(1 — Z"”)] (3.3.28)
E,,(0) = max (3.3.27)
—p lin
E,(p, q)
where
L= 2 VPy|ai)p (y | a2)
It follows trivially that this is maximized for all p by the vector q = (5, 4), and that
1 1/p
max E,(p, q) ie In (|
q
(3.3.29)
7 Furthermore, memoryless channels exist for which the product measure
as) = Tats)
does not optimize the expurgated exponent (Jelinek [1968b]).
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 151
Thus, for all binary-input memoryless channels
~pin(** 2°") - pr| (3.3.30)
E.,(R) = sup 5
p21
Furthermore as R > 0, we have, from (3.3.27) with q = (4, 3), that
E,, (0) = —4 In Z (3.3.31)
Note that this is the same as the exponent of (3.2.29) for low-rate orthogonal
signals (Z being the same in both cases), which being considerably greater than
the ensemble average bound exponent (Fig. 3.4) originally prompted our further
investigation of low-rate exponents. Also noteworthy are the facts that the present
results are for binary-input channels which need not be output-symmetric and,
somewhat surprisingly, that the uniform input weighting distribution is optimum
for all these channels.
3.4 EXAMPLES: BINARY-INPUT, OUTPUT-SYMMETRIC
CHANNELS AND VERY NOISY CHANNELS
The computation of exponential bounds for explicit channels is generally very
involved. Except for certain contrived (generally nonphysical) examples
(Probs. 3.2, 3.5) and some limiting cases, explicit formulas are not available. Even
for the particularly simple, often studied, binary-symmetric channel, the high-rate
exponents of both the ensemble average and the expurgated ensemble average
bounds can only be obtained in parametric form. These are nevertheless valuable
for later comparison.
For the BSC with crossover probability p < 1/2, beginning with the ensemble
average bound of the coding theorem (Sec. 3.2), we have from (3.2.1)
max E,(p, 4) = p In 2 — (1 + p) In [p+ + (1 — p+] (3.4.1)
+ |
since the maximizing distribution for this completely symmetrical channel is
always the uniform distribution. Upon substituting in (3.2.7) and (3.2.8), after
considerable calculation of derivatives and manipulations, letting
H (x) = —x Inx — (1 —x) In (1 — x) (3.4.2)
T,(y) = —y In x — (1 —y) In (1— x) (3.43)
which is the line tangent to #(x) at y = x, and letting
pili te)
= 3.4.4
Pp pater 4 f= p)ua +p) ( )
we find for low rates
E(R) = In 2—2 In (\/p+./1—p)—R 0<R<in2—¥(— SF _)
(3.4.5)
152 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and we find for high rates the parametric equations
E(R) = T,(p,) — #(P,)
R=In2— #(p,) ee eS
Ji
in 2— (NT) <R<in2- #6) (3.4.6)
As for the expurgated ensemble average error bound (Sec. 3.3), for the class of
binary-input channels® (which includes the BSC as the simplest case), we have
from (3.3.29)
1+ ZX
=m)
and consequently maximizing (3.3.30) we obtain, after some manipulation, the
parametric equations
E.,,(R) = —6 In Z
max E,(p, q) erste ad In |
q
(3.4.7)
Z
where
Zile
For the BSC, as was shown in (3.2.27b), Z = ,./4p(1 — p).
The exponent of the exponential upper bounds for any channel is charac-
terized mainly by three parameters:
1. E(O) = max E,(1, q), the zero-rate ensemble average exponent
q
2. E.,(0) = lim max E, (p,q), the zero-rate expurgated ensemble average
po 4q
exponent
3. C = max [(q) = lim max OE (p, 4)
q p?O q
, the channel capacity
These are important for two reasons. First, as can be seen in Fig. 3.6 the latter two
represent the E-axis and R-axis intercepts of the best upper bounds found, while
E(O}-R is the “support” line of the bound to which the low-rate and high-rate
bounds are both tangent.? More important, as we shall find in the next two
sections, both E.,(0) and C are similar parameters of the exponent of the lower
8 Recall that output symmetry is not required for the expurgated bound, because the optimizing
distribution is (4, 4) for any binary-input channel; this is not the case for the ensemble average
bound, however.
® For convolutional codes, as we shall discover in Chaps. 5 and 6, E(0) is the most important
parameter, especially in connection with sequential decoding.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 153
bound on error probability of the best code, and at least one point of the line
E(0)-R also lies on the exponent curve for the lower bound.
It is particularly instructive to examine these three parameters for the subclass
of binary-input, output-symmetric channels, which includes the binary-input
AWGN channel and its symmetrically quantized reductions, the simplest of which
is the BSC, as originally described in Sec. 2.8. For this class, all parameters are
optimized by the uniform distribution q = (5, 5). The first two parameters are
easily expressed in terms of the generic parameter Z [see (3.2.26) and (3.3.31)] as
E(0) = In 2 — In (1 + Z) (3.4.10)
E,,(0) = —41n Z (3.4.11)
Z => /Poly)pi(y) (3.4.12)
Capacity is more difficult to calculate but is readily expressed, upon using the fact
that p;(y) = po(—y) in (3.2.13), as
where
C=Y poly) In poly) — ¥ p(y) In p(y) (3.4.13)
¥ y
where
+ Po(—
For the AWGN channel, the first two parameters are characterized by
Ze tite (AWGN) (3.4.15)
as was first established in (3.2.27c), and the capacity is
fo)
C = —4 In 2ne — | p(y) In p(y)dy = (AWGN)
nee
where!®
Poly) = Te exp | = (» es x) P| (3.4.16)
and p(y) is given by (3.4.14).
For the BSC considered as a two-level quantized reduction of the AWGN
channel, we have from (3.2.27b) and (3.4.13) or (3.4.6)
Z = ./4p(1 — p) (BSC) (3.4.17)
C=In2—#(p) (BSC) (3.4.18)
where
p = Q(,/26,/N,)
*° See Eq. (2.11.6).
154 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Intermediate cases of soft quantization require calculation of po(y) for y € {b;, 52,
.., b,} as a function of &,/N,. With symmetric octal quantization, these are
determined in (2.8.1). Calculation of Z and C using (3.4.12) and (3.4.13) is straight-
forward, but tedious and can best be handled numerically. The results for the
AWGN channel, BSC, and the binary-input octal-output-quantized channel are
shown in Fig. 3.8 where all three parameters, (3.4.10), (3.4.11), and (3.4.13),
normalized by &,/N,, are plotted as a function of &,/N,.
Most noteworthy is the behavior as &,/N, — 0. It appears from the figure that
for the AWGN channel
é
Cx—* for é,/N,<1 (AWGN) (3.4.19)
oO
z
wo
is)
z
uD
S
<a)
| | | | | |
0
—10 10
0.5 fe
ot ae ah
Cee LA ae
CA
S
~% «0.2
Se)
Ob Figure 3.8 Exponents and capacity for
0 | | | | | | | binary-input symmetrically quantized—
he ee att Soe ee 2 ee output AWGN channels J = 2 > hard
quantization; J = 8 > octal; J = 0 >
& ./No decibels unquantized.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 155
while for the BSC
Cr bed for &,/N, <1 (BSC) (3.4.20)
aN,
For the octal channel with uniform quantization and a = 4,/N,/2 (see Fig. 2.13)
C x 0.95 a for &,/N, < 1 (octal) (3.4.21)
o
Also of interest is the fact that for all these channels
E.,(0) ~ E(O) = C/2 for 6,/N, <1 (3.4.22)
Thus, as the symbol energy-to-noise density ratio becomes very small, it appears that
the expurgated ensemble bound blends into the ensemble bound and both have an
E-axis intercept at C/2; hard quantization causes a loss of a factor of 2/z in all
parameters, and soft (octal) quantization causes a negligible loss relative to un-
quantized decoding.
The asymptotic relations (3.4.19), (3.4.20), and (3.4.22) can be easily shown
analytically (Prob. 3.12). In each case, letting &,/N, — 0 results in a channel which
is an example of a very noisy channel. This class of channels is characterized by the
property
pP(y|x) = p(y)[1 + ex, y)] all x, y (3.4.23)
where
le(x, y)| «1
and
P(y) = ¥ a(x)p(y |x) (3.4.24)
Since q(x) is the input weighting distribution used in all bounds, it follows that
p(y) is also a distribution, sometimes called the output distribution.!! Hence
1=¥' p(y|x) => pl + €, y= 1+ Dd piv y) (3.4.25)
and
= p(y) [ + d q(x)e(x, »| (3.4.26)
‘! Note that p(y) is the actual output distribution when the input distribution is g(x); however, the
weighting distribution q(x) is only an artifice used to define an ensemble of codes—it says nothing
about the actual input distribution when a particular code is used on the channel.
156 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
From (3.4.25) we obtain
Y p(ye(x, y)=0 forallxeX (3.4.27)
y
and from (3.4.26)
¥ a(x)e(x, y)=0 for all y for which p(y) > 0 (3.4.28)
Since the optimizing input distribution q(x) = 4 for sti inputs, it is easy to
verify that, for a BSC with p = 3(1 — €) where |e| < 1, p(y) = 4 for both outputs and
(3.4.23) holds with
+€ for x= 7
pieces —¢€ forx#y
A similar but more elaborate argument (Prob. 3.13) shows that, for &,/N, < 1, the
unquantized AWGN channel satisfies the definition (3.4.23) of very noisy channels,
as one would expect. Now using the definition (3.4.23) and the resulting properties
(3.4.27) and (3.4.28), we obtain for the basic function of the ensemble bound (3.2.1)
Folie th » ” q(x)p(y) 4 * LL + e(x, y) na
Since |e| < 1, we may expand (1 + ¢)“*” in a Taylor series about ¢ = 0 and
drop all terms above quadratic powers. The result is
ah So
q)* —In > Top q(x) E(x, y) _pe*(x, y)
l+p 21+ py
~ =In¥ pb) ~ 5 i+ pp ee)
where the last step follows from (3.4.28). Expanding the result in a Taylor series
about « = 0 and again dropping terms above quadratic, we obtain
y Ud ale) x, y)
Ta
E,(p, q) ¥ —In
ee
x1 = 5 yd 2 a(x) x, y) (3.4.29)
But for the same class of channels, cae the same operations, we obtain for
the channel capacity
C = max I(q)
q
= max » Y q(x)p(y)[1 + €(x, y)] In
p(y) + (x, y)]
P(y)
ee, y) - Ee
a) (y)[1 + €(x, y)
2
~ max a > q(x)p(y) ath y) (3.4.30)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 157
Thus maximizing (3.4.29) over q and using (3.4.30), we obtain
max E,(p, q) PAG (3.4.31)
E,(p) = Soe
4 1+p
For this class of very noisy channels, the ensemble average error bound exponent
(3.1.20) thus becomes
max E,(p, q) —
E(R) = max pR|
Osp<i-.l «¢
ss ant s4 C= pR 3.4.32)
O<p<l 1 a p (
But this is identical to the problem of maximizing the negative exponent of (2.5.15)
required to obtain the tightest bound on orthogonal signal sets on the AWGN
channel. Thus we employ the same argument that led from (2.5.15) to (2.5.16) to
obtain
(3.4.33)
/C = VRP ME SR/C<1
which is the function shown in Fig. 2.7. We defer comment on this remarkable
equivalence until we have also evaluated the expurgated bound exponent. For the
class of very noisy channels, we have from (3.3.14)
1/p
E,(p, 4) = —p In 2d a(x)a(x’) + €(x’, 5]
) 5 Po) [ ‘i ate y)- a ”
2
—p In Dd a(x)alx
E(x’, yy) e2(x’, y)] 1”
oo
~ —p In Y alx)ate)
E(x, y)e(x’, y) rag ) + e*lx, y) 1/p
x jb + y p(v)| a y |
= — 2d db ax)al')P0) ai ee ye) ; As
Thus finally
max E,(p, q) ~ meee ae) p(y)e2(x, y)/4
q
= 3C
(3.4.34)
158 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and from (3.3.19) we have that the expurgated bound exponent is
E.,(R) = sup }max E,(p, q) — pR
p21 q
eiC =k (3.4.35)
Since this coincides with the straight-line portion of the ensemble average bound
(3.4.33), it is clear that expurgation produces no improvement for a very noisy
channel. We note also that (3.4.33) and (3.4.35), evaluated at zero rate, confirm the
previous limiting result (3.4.22).
We turn now to showing that the coincidence of the results (3.4.33) for very
noisy channels with those of (2.5.16) for orthogonal signal sets on the AWGN
channel is not so surprising after all. For, while Sec. 2.5 dealt with arbitrary
orthogonal signal sets, we found in Sec. 2.10 that a binary orthogonal signal set
could be generated from an orthogonal linear code with the number of code
vectors equal to the number of symbols N. Now the symbol energy for this signal
set is &, = &/N where @& is the energy per signal. Thus no matter how large &/N,
may be, for large N, &,/N, becomes arbitrarily small; hence the code is operating
over a very noisy channel. To complete the parallelism, we note from (2.5.13) and
(2.5.14) that
—— = — 3.4.36
=C (3.4.36)
while
R= See ae (3.4.37)
Thus (2.5.16) may be rewritten using (3.4.36) and (3.4.37) as
Py <e7 TERD — ge NER)
where E(R) is given by (3.4.33).
This concludes our treatment of upper bounds on error probability of general
block codes. To assess their tightness and consequent usefulness, we must deter-
mine corresponding lower bounds on the best signal set (or code) for the given
channel and with the given parameters. In the next three sections, we shall
discover an amazing degree of similarity between such lower bounds and the
upper bounds we have already found, thus demonstrating the value of the latter.
3.55 CHERNOFF BOUNDS AND THE
NEYMAN-PEARSON LEMMA
All lower bounds on error probability depend essentially on the following
theorem which is a stronger version of the well-known Neyman-Pearson lemma
for binary hypothesis testing. After stating the theorem, we shall comment on its
uses and applications prior to proceeding with the proof.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 159
Theorem 3.5.1 (Shannon, Gallager, and Berlekamp [1967]). Let p%(y) and
p®(y) be arbitrary probability distributions (density functions) on the N-
dimensional observation space ¥,, and let Y, and Y, be any two disjoint
subspaces of Yy with ¥, and Y, their respective complements. Let there be at
least one y € Wy for which p(y)p®(y) # 0. Then for each s,0 <s < 1, at least
one of the following pair of inequalities must hold
Po = >. pry)
ye9,
> 4 exp [u(s) — su'(s) — s/20")] (3.5.1)
i= Ee pY(y)
ye,
> 4 exp [u(s) + (1 — s)y’(s) — (1 — s),/2u"(s)] (3.5.2)
where
u(s) = In ¥) py(y)’ spH(y)° (3.5.3)
7
is a nonpositive convex VU function on the interval 0 < s < 1. Furthermore, for
the choice
Yq = {y: In [py’(y)/pY(y)] < u'(s)}
=Y, (3.5.4)
then both of the following upper bounds hold
P.<ce sO (3.5.5)
Ps els) + (1 — s)u'(s) (3.5.6)
These latter two inequalities are known as Chernoff bounds. If we associate Y y
with the observation space for a two-message signal set, p(y) and p(y) with the
likelihood functions of the two signals, and ¥Y, and Y, with the corresponding
decision regions, it follows that P, and P, are the error probabilities for the two
messages. Thus, this theorem is closely related to the Neyman-Pearson lemma
(Neyman and Pearson [1928]) as can best be demonstrated by inspecting the
graph of p(s), a convex U nonpositive function on the unit interval, shown for a
typical channel in Fig. 3.9. We note in particular that (0) = (1) = 0 if and only
if, for every ye Wy, pO(y)p@(y) #0, a condition met by most practical
channels.'? We note further that for memoryless channels, since
N
PS’(yn) = [] PWnlXmn) m=aorb
n=1
12 However the Z channel described in Probs. 2.10 and 3.17 does not meet this condition; it has
u(1) = N In p<0.
160 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
u(s)
0 So 1
(Sq) — SoHt'(Sq)
/
/
/
/
re M(sy) + (1 — Sou’ (Sy) -
Figure 3.9 Typical y(s) and relation between exponents of P, and P,.
we have for the two code vectors x, and x,
us) = Yin | pry Xan)!SPCl Xn)? (35.7)
n=1 yn
which grows linearly with N. Consequently, as N — oo, the square roots in (3.5.1)
and (3.5.2) become asymptotically negligible in comparison with the other terms
in the exponents. Thus, if we disregard these terms as well as the asymptotically
even less significant factors of 1/4, we find that the alternative lower bounds
become identical to the upper bounds. Then it follows, as shown in Fig. 3.9, that
the line tangent to p(s) at some point so will intercept the two vertical lines s = 0
and s=1 at negative values of uw exactly equal to the two exponents. It also
follows from the statement of the theorem that, fixing the exponent (and hence the
asymptotic value) of P, at [u(s,) + (1 — s,)u'(s,)] where s, € [0, 1], guarantees that
the exponent of P, will be [u(s,) — s,'(s,)] and that no lower (more negative)
exponent is possible for P,. A lower value for the exponent of P, (or P,) requires
repositioning of the tangent line on this functional “see-saw,” with a resulting
increase in the value of the other exponent. Thus it should be apparent that the
theorem is essentially equivalent to the Neyman-Pearson lemma, although it
contains somewhat more detail than the conventional form. The parallel is com-
plete if we note that the subspaces, which make both upper bounds equal asymp-
totically to the lower bounds and hence the best achievable, are given by (3.5.4).
But these correspond to the likelihood ratio rule, which is the optimum according
to the Neyman-Pearson lemma, with threshold p'(s), which is the slope of the
tangent line in Fig. 3.9.
We note finally that, in the two-message case over an N-dimensional mem-
oryless channel, if we require P, and P, to be equal, then we must choose s such
that p(s) = 0 in (3.5.4). Then (3.5.5) and (3.5.6) give identical upper bounds, and
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 161
(3.5.1) and (3.5.2) give asymptotically equal lower bounds. We conclude that, if
s = s, where p'(s,) = 0, then
else) NWN) < P< gulso) forx=aorb (3.5.8)
where!? o(N)—>0 as Noo. With reference to Fig. 3.9, it is clear that this
corresponds to the case where the straight line is tangent to the minimum point
of u(s).
PRoor We now proceed to prove the theorem, beginning by twice differen-
tiating (3.5.3) to obtain
d Pry) )~spP(y)* In [pP'(y)/pP(y)]
Avge d PY(y')* p(y’)? oa
2, Pry) )*~spY(y)*{ln [px’(y)/Py’(y)]}?
p(s) = Y poy’ )i- s p&(y’)s es [u'(s)]? (3.5.10)
Now we denote the log likelihood ratio by
D(y) = In [py(y)/PH(y)] (3.5.11)
Also, in the interval 0 < s < 1, we define the “tilted” probability density
ot inci) ta Ok
O%(y) ne y p(y’)! spy’) Gsas<1 (3.5.12)
As the tilting variable s approaches 0 and 1, Q§)(y) approaches p¥(y) and
p(y), respectively. Now if we take y to be a random vector with probability
(density) QY(y), it is clear from (3.5.9) through (3.5.12) that the random
variable D(y) has a mean equal to p’'(s) and a variance equal to p"(s); con-
sequently, p(s) > 0. Furthermore, it follows from (3.5.3) that u(0) <0 and
u(1) <0 with equality in either case if and only if, for every ye Wy,
pY(y)p?(y) # 0. Thus it follows that p(s) is a nonpositive convex function in
this interval.
Comparing (3.5.3), (3.5.11), and (3.5.12), we see immediately that
p(y) = en s199(y) (3.5.13)
pP(y) = entorcd-seng ey) (3.5.14)
We can now establish the upper bounds (3.5.5) and (3.5.6). Let the decision
regions be chosen according to (3.5.4), corresponding to a likelihood ratio
13 Here o(N) = 1/,/N.
162 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
decision rule with threshold y'(s). Then, using (3.5.11), these may be expressed
as
Y, = ty: Diy) < u'(s)}
= Y, (3.5.15)
from which it follows that
—sD(y)< —sp'(s) forallyeY%,,0<s<1 (3.5.16)
and
(1—s)D(y)<(1—s)u(s) forallyeY,,0<s<1 (3.5.17)
Consequently, from (3.5.13) and (3.5.14), we have
Pes © wy) < exp Las) ~ (sl O89) (3.5.18)
Py= Yay) <exp [als) + (1 sw')] L OPW) B59)
and, since Q\)(y) is a probability (density), the sums (integrals) in (3.5.18) and
(3.5.19) are bounded by unity, which yields (3.5.5) and (3.5.6).
We now prove the lower bounds of (3.5.1) and (3.5.2) for arbitrary disjoint
decision regions. We begin by defining the subspace
a, = {y: |Diy) — n(s)| < /2n")} (3.5.20)
Then, recalling that p(s) and y"(s) > 0 are respectively the mean and variance
of D(y) with respect to the probability density Q4)(y), we see from the Cheby-
chev inequality that
y, Oly) = Pr {|D(y) — E{D(y)]| > /2u"(s)}
var, [D(y)] _1
2u"(s) 2
where E,[ -] and var,| -] indicate the mean and variance with respect to QY(-).
Thus
(3.5.21)
and we may lower bound P, and P, by summing over a smaller subspace in
each case, as follows.
P,= >.pry)
yey,
> YL py) (3.5.22)
ye¥,nY,
», Pry)
yey,
> > pey) (3.5.23)
ye%,o9;s
a
ll
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 163
But, for all y e Y,, it follows from (3.5.20) that
u'(s) — /2u"(s) < D(y) < u's) + V2n"(s) (3.5.24)
and consequently, from (3.5.13), (3.5.14), and (3.5.24), that for all ye Y,
Py’(y) = exp [H(s) — su’(s) — a 2u"(s)]On(y) (3.5.25)
pX(y) = exp [u(s) + (1 — s)u(s) — (1 sh/2u"(S)]ON(Y) (3.5.26)
Then since the regions of summation (integration) for the right sides of
(3.5.22) and (3.5.23) are subspaces of Y,, it follows that
P, > exp [u(s) — sy’(s) — s\/2n"(s)] » er y) (3.5.27)
P, > exp [p(s) + (1 — s)u'(s) — (1 — s),/2u"(s)] p oN) (3.5.28)
Finally, since Y¥, and Y, are disjoint, we have
Y U Y = Yr
Hence, it follows from this and the consequence of (3.5.21) that
y Oy)+ F Ovy)= Y OPO)
ye9, 49, ye%, oY, yes
Thus, at least one of the following pe as must hold
p or y)> (3.5.29)
2 : Ny) >4 (3.5.30)
Combining (3.5.27) through (3.5.30) yields the lower-bound relations (3.5.1)
and (3.5.2), and hence the balance of the theorem.
We have already drawn the immediate parallel to binary hypothesis testing.
In applying the theorem in the next section to lower-bound code error probabili-
ties, we shall demonstrate its further power relative to M hypotheses. Before
proceeding with this more general case, however, we specialize the result to obtain
an upper bound on the tail of the distribution of N independent identically dis-
tributed random variables y,. Thus let
T= 2 Ve (3.5.31)
and
pY(y) = [] p(y.) (3.5.32)
164 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where p(-) is the common probability distribution (density) of all N variables. Let
us further define the dummy distribution (density)
PY(y) =e" ™*p(y) (3.5.33)
where « is a constant chosen to properly normalize its sum over Y,y to unity. This
will allow us to apply the previous theorem since
In [px’(y)/Px(y)] =n — Na (3.5.34)
Consequently (3.5.4) reduces to
Yi ={y:n<p(s)+ Na} (3.5.35)
and (3.5.5) reduces to
Pa= ), Pry)
yeY%,
= Pr {y: 7 > y'(s) + Na}
< ells) sus) (3.5.36)
where, as follows from (3.5.31) and (3.5.33)
p(s) = In: D) pYly). ef 2"
x
= Nin » P(y) o — Nas (3.5.37)
Thus, if we let
0 =pu'(s) + Na
= y'(s)
we obtain from (3.5.36) as an upper bound on the tail of the distribution of n
Pr {n > 0} < e)-87') RUS 48)
where y'(s) = 8 and
yW(s)=NIn> ply)e” O<s<il
y
This is also a Chernoff bound and, as one would suspect, can be derived more
directly than from the above theorem (see Probs. 2.10 and 3.18). Furthermore, by
arguments very similar to those used in the proof of the theorem, the bound (3.5.38)
can be shown to be asymptotically tight (Gallager [1968]).
3.6 SPHERE-PACKING LOWER BOUNDS
Theorem 3.5.1 provides the tools for obtaining lower bounds for any discrete-
input memoryless channel. Its application in the general proof, however, involves
an intellectual tour-de-force, for which the reader is best directed to the original
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 165
work (Shannon, Gallager, and Berlekamp [1967]).'* We shall content ourselves to
state the general result at the end of this section. On the other hand, the flavor,
style, elegance, and even the major details of the general lower-bound proof are
brought out simply and clearly by the derivation of lower bounds for two special,
but important, cases: the unconstrained bandwidth AWGN channel with equal-
energy signals, and the BSC. We proceed to consider them in this order, and then
return to a discussion of the general result.
3.6.1 Unconstrained Bandwidth AWGN Channel with Equal-Energy
Signals
Let each of the M signals have duration T seconds and equal energy &, while the
additive white Gaussian noise has one-sided spectral density N, W/Hz. By lack of
bandwidth constraints, we mean that no limitations are placed on the signal
dimensionality N or, equivalently, on W ~ N/2T as discussed in Sec. 2.6. How-
ever, as we found in Sec. 2.1, any set of M finite-energy signals can be repre-
sented using at most M dimensions. Thus unconstrained bandwidth means simply
that we do not restrict N to be any less than M. In Sec. 2.1, we found that the
likelihood function for the mth signal-vector sent over this channel is
ae Xmn)2/No
Pw(¥ |Xm y= 1 5 JaN,
(2.1.15)
where
[Xm ||? = le
= (3.6.1)
We express this more conveniently for our present purpose as
e yn7/No
Py(¥ |Xm) = exp (- 642) yn ae )/v vie I] (3.6.2)
Our immediate goal is to lower-bound
Pe Tey) e) dy (3.6.3)
YEA
for the maximum likelihood decision region A,, given by
Am = {Y: Pr(¥|Xm) > Pw(¥|Xm’) for all m’ + m} (3.6.4)
with boundary points resolved arbitrarily. We have at our disposal Theorem 3.5.1.
‘* Or the more recent and somewhat more direct approach of Blahut [1974] and Omura [1975] (see
Probs. 3.22 and 3.24).
166 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Clearly, we wish to associate py(y|x,,) and A,, with one inequality in this theorem,
but the choice of the other appears to be an enigma. We proceed, just as in the last
example of the previous section [(3.5.31) through (3.5.38)], by choosing the other
to be a convenient “dummy” probability density; namely
N ot 2/No
Py(y) = te ear (3.6.5)
and we let
PY(y) = Pr(y) PHY) = Pw(¥ | Xm)
while
YY, =A, =4%, (3.6.6)
We have then met the conditions and hypotheses of Theorem 3.5.1 and may
therefore apply (3.5.1) through (3.5.3) to conclude that, for each transmitted signal
vector x,,, at least one of the following pair of inequalities must hold.
Vn = | = | Py(y) dy > 4 exp [u(s) — su’(s) — s\/2u"(s)] (3.6.7)
yeAm
Pr,=| + | Pw(y|%m) ay
y € Am
> exp [u(s) + (1 — 5)(s) — (1-5) 20") (3.6.8)
where
p(s) = In i ee o Py(y)' *Py(¥|Xm)' dy O<s<l (3.6.9)
Substitution of (3.6.2) and (3.6.5) in (3.6.9), using (3.6.1), yields
s(t 3) xbo] exp |— 3 Dr Xm PIN
p(s) = In exp | — a | =a | GN, ? dy
es s(1 — s) (3.6.10)
No
Thus p(s) is invariant to the signal vector’s orientation and depends only on its
energy. To determine the significance of the auxiliary variable w,, of (3.6.7), we
sum over all messages m. Since the optimum decision regions (3.6.4) are disjoint
and their union covers the entire N-space, we obtain
M M 0 00
Yvm= Lf [evyydy=[ { pey)dy=1 (36.11)
m=1 m=1 yeAm — @ — @
Hence, for at least one message m, we must have
Wis <1/M (3.6.12)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 167
for otherwise the summation (3.6.11) would exceed unity. It follows therefore that,
for this message m, W,, may be upper-bounded by 1/M. Consequently, letting
P,_,, = max Pz, (3.6.13)
we conclude from (3.6.12), (3.6.13), and (3.6.7) through (3.6.10) that at least one of
the following pair of inequalities must hold.
1/M >wW,, > 4 exp [u(s) — su’(s) — s\/2u"(s)] (3.6.14)
Pz... = Pr, > 4 exp [(s) + chs — s)y'(s) — (1 —s),/2u"(s)] (3.6.15)
where
us) = — s(t - 5)
= —TC;s(1—-s) (3.6.16)
Consequently
u(s) = —TC,(1 — 2s) (3.6.17)
pis) = 27C+ (3.6.18)
In the last three equations, we employed the notation of Sec. 2.5, namely
C, =(6/N,)/T (2.5.13)
We shall also use the rate parameter defined there, namely
R;= am nats/s (2.5.14)
Upon use of (3.6.16) through (3.6.18), (2.5.13), and (2.5.14), the lower bounds
(3.6.14) and (3.6.15) become the alternative bounds’*
Ry < T [FGgsiit 2s, /FC,-+ In 4]
= C;s? + o(T) (3.6.19)
and
Py, > exp {—[TC;(1— s)? + 2(1 — s),/TC; + In 4]}
= exp {—7[C,(1 — s)? + o(T)} (3.6.20)
Since at least one of this last pair of inequalities must hold, we choose s = s, such
that
Ro Cate + o(T) (3.6.21)
where
0<R,<Cr
1S Here o(T) = 1/,/T.
168 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
or equivalently
So = /[Rr — 0(T)|/Cr
where
O <5; <.h
Then (3.6.19) is not satisfied; consequently (3.6.20) must be satisfied with s = s,
yielding finally
Pru, > exp {—T[Cr(1 — s,)? + o(T)}
= exp {—T(\/Cz — JR = A(T)? + o(T)}
= exp {(—T[(\/Cr — V/Rr)? + 0(T)} (3.6.22)
While (3.6.22) lower-bounds the probability of error for the worst case, we
actually desire to bound the average error probability
P,(M) = (1/M) 3 2 (3.6.23)
m=1
Now suppose we have the best code of M/2 signals. From (3.6.22), we see that the
maximum error probability among this set of signals is lower-bounded by
Prga,(M/2) > exp {—T(/Cr — J/Rz)? + 0(T)} (3.6.24)
where
In (M/2)
T
= R; — o(T)
R7 =
Thus R7 can be replaced by R; in (3.6.24). On the other hand, for the best code of
M signals, at least M/2 of its code vectors have
P,, <2P,(M) (3.6.25)
But this subset can be regarded as a code for M/2 signals. Hence, the error
probability for the worst signal in this case must be lower-bounded by (3.6.24)
which pertains to the best code of M/2 signals. As a result, we have
P,(M) > 3Pr,,,,(M/2)
> exp {—T[E,,(Rr) + 0(T)} (3.6.26)
where
E,,(Rr) ms Lite 1 «| Bey 0 pa Ry < Cr
Amazingly enough, for the range of rates C;/4 < R; < Cr, this lower bound
agrees asymptotically with the upper bound for orthogonal signals of (2.5.16). For
lower rates, the upper bound and this lower bound diverge. However, in the next
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 169
two sections, we shall determine tighter lower bounds for low rates that agree with
(2.5.16) also for 0 < R; < C,/4.
One minor consequence of these results then is that they establish that orthog-
onal signals are asymptotically optimum (as T and M-— oo) for the uncon-
strained AWGN channel. (Regular simplex signal sets are always better, but
asymptotically they are indistinguishable from orthogonal sets.) More impor-
tantly, we have demonstrated in a special case a very powerful technique for
obtaining asymptotically tight lower bounds at all but low rates. This bound is
called the sphere-packing bound for essentially historical reasons, based on the
classical proof for this example and the next example (Fano [1961]). (See
Probs. 3.22 and 3.24 for another proof of the sphere-packing bound.)
3.6.2 Binary-Symmetric Channel
We now turn to the application of Theorem 3.5.1 to the classically most often
considered channel, the BSC, repeating essentially the arguments used for the
AWGN channel but with different justification. In Sec. 2.8, we showed that the
likelihood function for this channel is
Py(y |Xm) = p*(1 — p)*~* (2.8.3)
where d,, = w(y ® x,,) is the Hamming distance between the channel input and
output binary vectors. For the dummy distribution, we pick in this case the
uniform distribution
Prly)=2°% for allye Wy (3.6.27)
Here we identify’® p(y) with p¥)(y) and py(y|x,,) with p¥)(y), and consequently
also identify A,, with Y, and A,, with Y,. Since these quantities meet the condi-
tions of Theorem 3.5.1, we can "then apply (3.5.1) and (3.5.2) to assert that, for
message m, at least one of the following pair of inequalities must hold:
Vm = Y 27% > dexp [uls) — su'(s) — s/w") (3.6.28)
yeAm
P;,= dP py
> o exp [u(s) + (1 — s)u(s) — (1 — s),/2n"(s)] (3.6.29)
where
ake Fe pee pyre Osx (3.6.30)
y
and d,, = w(y @ x,,). But, since x,, is some N-dimensional binary vector and y runs
over the set of all such vectors, it is clear that there exists exactly one vector y
(namely, x,,) for which d,, = 0, N vectors y for which d,, = 1 (at Hamming distance
‘© It is actually immaterial whether this or the opposite association is chosen. In the latter case, we
would have to define s = p/(1 + p) instead of (3.6.39).
170 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
1 from x,,), (3) vectors y for which d,, = 2, and generally (j') vectors y for which
dn = k (0 <k < N). Thus (3.6.30) may be written and summed as
N
p(s) = In) @)[p*(1 — ph “2-8
k=0
= N{In [(1 — p)* + p*] — (1 — s) In 2} (3.6.31)
To identify w,,, we again recognize that the A,, are disjoint decision regions whose
union covers the total space ¥,. Hence, summing over all messages, we have
Zen BBP
m=1 yeAm
yz" 1
MA
Win <1/M (3.6.32)
From (3.6.28) and (3.6.29) we have the two alternative inequalities for some mes-
sage m
and hence for some m
1/M > Wa, > 4 exp [u(s) — su’(s) — s,/2u"(s)] (3.6.33)
Prax 2 Pe, > 4 EXP [u(s) + (1 — s)u'(s) — (1 — 5),/2u"(s)] (3.6.34)
where from (3.6.31) we have
u(s) — su'(s) = N{—In 2 + In [(1 — p)* + p*] — sd(s)} (3.6.35)
u(s) + (1 — sy (s) = N{ln [(1 — p)’ + p'} + (1 — 5) 3(5)} (3.6.36)
where
(1 — p)' In (1—p)+p*Inp
O(s) = 3.6.37
(s) eae ee i
and'’
2u"(s) = N o(N) (3.6.38)
Finally, if we make the substitution
1
we find that (3.6.35) and (3.6.36) become, after some algebraic manipulation,
u(s) — su(s) = —NE({{(p) (3.6.40)
s=1/(1 +p)
u(s) + (1 — s)u'(s) = —N[E,(p) — pE,(p)] (3.6.41)
s=1/(1+ p)
"7 Here o(N) = 1/./N.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 171
where
E,(p) = p In 2 — (1+ p) In [(1 — p)4*” + p+)
Note that this is identical to Eq. (3.4.1) which represents the basic exponent
function for the BSC with input weighting distribution optimized at q = (5, 4).
We may now conclude the argument by defining the rate in nats per binary-
channel symbol as
R = (In M)/N nats/symbol (3.6.42)
and choosing p = p* positive. Consequently s* = 1/(1 + p*) € [0, 1] is the appro-
priate value such that
R
(in M)/N
(—1/N) In ets*)~s*#'s*)— Now)
E,(p*) + o(N) (3.6.43)
where we have used (3.6.40). This then satisfies (3.6.33) with equality and con-
sequently requires that (3.6.34) must be an inequality. Thus, using (3.6.41), we
have
i > ells*) + (1 — s*)u'(s*) — No(N)
= eg NlEo(e*)— p*Eo'(p*) + o(N)] 0<p*<@ (3.6.44)
By exactly the same argument which led to (3.6.26), we then have
P,(M) > 4Pz_, (M/2)
en (3.6.45)
where E,,(R) is defined by the parametric equations
E.,(R) = E,(p*) — p*E,(p*) O<p*<o
R= Ejf(p*) O<R<C (3.6.46)
The limits of R are established from the properties of E,(p) (Sec. 3.2); namely, the
facts that E,(p) is a convex ~™ monotonically increasing function and that
ste E’(p) = C and aa E,(p) = 0. But (3.6.46) is then identical to the upper bound
F(R) of (3.2.8) for the | higher: rate region, E,(1) < R < C, for the BSC for which the
latter is optimized for all rates by the choice q = (5, 5). For lower rates, 0 < R <
E,(1), the lower-bound exponent E,,(R) continues to grow faster than linearly
since the function is convex U, while the upper-bound exponent E(R) grows only
linearly (see Fig. 3.10). The gap between the upper and lower bounds at low rates
will be reduced in the next two sections.
By analogy to (3.2.6), (3.2.8), and (3.2.12), it follows also that the lower-bound
exponent (3.6.46) can be written as
E.,(R) = max sup [E,(p, q) — pR] (3.6.47)
q p20
172 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
E.,(R)
E.. (0) Low-rate improved lower bound
E.x(R)
E(R)
Ey (1)
R__ Figure 3.10 Exponents E(R), E,,(R),
0 ¢ E,,(R) and low-rate lower bound.
Thus the construction from the E,(p, q) function (see Figs. 3.1 and 3.2) is the same
as for the upper bound but, rather than terminating at p = 1 for R = E/(1) (see
Fig. 3.3a), it continues on for all p and hence approaches R = 0. This also explains
why the bounds diverge for rates below R = E,(1).
We have thus obtained almost the same result for the BSC as we had
previously for the AWGN channel; namely, that the lower bound is asymptot-
ically the same as the upper bound (and identical in exponent) for all rates above
some critical medium rate. The results for both of the above special cases can be
obtained in a more intuitive, classical manner using a so-called sphere-packing
argument (see, e.g., Gallager [1968]). We have chosen this less intuitive approach
for two reasons: first, it augments and illustrates the power of Theorem 3.5.1, the
strong version of the Neyman-Pearson lemma; second, it demonstrates the key
steps in the proof for any discrete-input memoryless channel. By these same basic
arguments, augmented by other somewhat more involved and sophisticated
steps,'® the following general sphere-packing lower bound has been proved.
18 The simplicity of the proofs for the BSC and AWGN channel is due to the considerable input and
output symmetry of these channels. Without this natural symmetry, one must impose the formalism of
“fixed composition codes,” whose justification and eventual removal obscures the basic elegance of
the above technique.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 173
Theorem 3.6.1 (Shannon, Gallager, Berlekamp [1967]) For any discrete mem-
oryless channel, the best code of M vectors [rate R = (In M)/N] has
P,(M) > e7 NEs(R)+
where E,,(R) is given by the parametric equations (3.6.46) and E,(p) is identi-
cal to the function defined for the upper bound on P, for the given channel.
3.7 ZERO-RATE LOWER BOUNDS
As we have just noted, the upper and lower bounds, which agree asymptotically
for R > E;(1), diverge below this rate and are farthest apart at R = 0. We now
remedy this situation by deriving new zero-rate lower bounds for the AWGN
channel and for all binary-input, output-symmetric channels, which agree asymp-
totically with the least upper bound in each case at zero rate. This consequently
guarantees that the expurgated upper bound is asymptotically exact at zero rate.
The low-rate problem is treated in the next section.
3.7.1 Unconstrained Bandwidth AWGN Channel with Equal-Energy
Signals
The principal parameter utilized in low-rate bounds is the minimum distance
between signal vectors. For M real signal vectors of equal energy in an arbitrary
number of dimensions, we upper-bound this minimum distance by first upper-
bounding the average distance between distinct vectors.'?
1
ac ees
(d ie = M(M — 1) mp? |x; x, ||
1
= wt 1) > > {lbs? + Desi? — 260, x3}
2M 2M 1
eT eae ~ Y (:, x;)
M-1 M-—1 M74 : j
ra. SPO. a
M-1 repe
2M
=y_1°
with equality if and only if the centroid = > Xe = 0.
*? The average involves only those terms for which i 4 j; hence the denominator is the number of
such terms. However, in the summation we include the i = j terms since they are all zero.
174 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Consequently
ate [min Ix, — x4]?
iF-j
< (d?)}/? <./26M/(M — 1) (3.7.1)
Equality holds in (3.7.1) if and only if the signal set is the regular simplex defined
by (2.10.19).
We now apply this result to lower-bounding the error probability for any such
signal set on the AWGN channel. It is reasonable to expect that the greatest
contribution to this error probability will be that resulting from the closest signal
pair. Arbitrarily designating these two signals as x, and x,, we have
Paci Ppp Pefl=e2) (3.7.2)
where the notation for the right-hand inequality is that of (2.3.4) and denotes the
pairwise error probability when only the two signals x, and x, are possible and
the former is transmitted. This inequality follows from the fact that eliminating all
signals but x, and x, from the signal set allows us to expand both decision regions
and thus obtain a lower error probability. Further, in Sec. 2.3 [Eq. (2.3.10)], we
determined this error probability to be
Q(||x: — X2//2N.)
Q(d O(dmin/y 2N,)
Q(
./€M/N,(M — 1)) (3.7.3)
P,(1—>2)=
where the last inequality follows from (3.7.1) and the fact that the function Q(x) is
monotonically decreasing in x. Finally, from (3.7.2) and the classical bounds for
the error function given in (2.3.18), we have
Po pi e €/2No+ o(T) ‘ (3.7.4)
where o(T) goes to zero as T goes to infinity. Thus, using the same argument
which led to (3.6.26) and the same notation as (2.5.13), we have°
P;(M) > 3P Ena (M/2)
> “2 T(C7/2 + 0(T)] (3.7.5)
While this lower bound on P, for the best code is independent of rate, it agrees
asymptotically with (2.5.16), the upper bound (for orthogonal signals), only at
R, = 0. Also, at high rates, it is clearly looser (smaller) than the sphere-packing
bound (3.6.26). In fact, in the next section, we shall discuss a low-rate bound which
begins with this result and improves on it for all rates 0 < R; < C,/4.
0 This form could also have been obtained from (3.5.8).
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 175
3.7.2 Binary-Input, Output-Symmetric Channels
The zero-rate lower bound is as easily obtained for this more general class as for
the BSC. The first step again is to upper-bound the minimum distance among
code-vectors. Unsurprisingly, the argument is somewhat reminiscent of the above
for the Gaussian channel. We summarize it in the following lemma due to Plotkin
[1951].
Lemma 3.7.1: Plotkin bound For any binary code of M code vectors of
dimension N, the normalized minimum distance between code vectors is
upper-bounded by
J. M
< —_—_—__
N= 2 = 1)
(3.7.6)
PRooF We begin by listing all binary code vectors in an M x N array
Xi1 X42 X1N
Xi1 = Xii2 Xin
Xj. Xj2 X jn
Xmi *m2 XMN
Let d(x;, x;) = w(x; ®x,) be the Hamming distance between x; and x,, and
consider the sum over all pairwise Hamming distances (thus counting each
nondiagonal term twice and not bothering to eliminate the case i = j since
it contributes 0 to the sum)
p2 z d(x;, x;) ag y e2 2 d(Xin X in) (3.7.7)
where
0 if Xin = Xjn
(Xin; X jn) a W(Xin © X jn) ar 1 otherwise
Let v(n) be the number of zeros in the nth column. Clearly for any good
code, v(n) < M; for otherwise that column could be omitted without decreas-
ing d,,in- Lhen, for each column n, there is an m for which x,,,, = 1. Thus there
are v(n) values of m’ for which x,,., = 0 and hence for which x,,., # Xmn-
Consequently
y de> Suing —_ v(n)
m’'=1
176 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Furthermore, by the same assumption, there are M — v(n) values of m for
which x,,, = 1. Thus
M
3 ps dar Xm'n) ay [M ss v(n)}v(n) (3.7.8)
m:Xmn=1 m’'=1
At the same time there are v(n) values of m for which x,,,, = 0, and con-
sequently M — v(n) values of m’ for which x,,,,, = 1. Thus
Sad ngs taps) ariel ott) (3.79)
m;: Xmn=0 m’=1
Adding (3.7.8) and (3.7.9), we obtain
Sista lt et
m=1 m’'=1
M2
<— (3.7.10)
z
since the factor v(M — v) is maximized by v = M/2. Substituting in (3.7.7), we
obtain
M M N
De ee A(Xm > Xm ) s zen ae: v(n)]
NM?
wre
ao (3.7.11
But, since d(x,,, X,,) = 0 trivially for all diagonal terms, letting
dinin = MIN d(Xm5 Xm’)
mFm’
be the minimum of the nondiagonal terms, we have
M M M
oy d(Xm> Xin’) ns 2 » d(Xm Xm’)
m=1 m’'=1 m=1 m'#m
> M(M — 1) dain (3.7.12)
Combining the inequalities (3.7.11) and (3.7.12) we obtain
go Diol
2\M-1
which is just (3.7.6) and hence proves the lemma.
We now proceed just as for the AWGN channel. Denoting by x, and x, two
code vectors at minimum distance, we use (3.7.2) again
Pe. > Pe Pots?) (3.7.2)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 177
with the same justification as before. But in Sec. 3.5 we showed that the two-
message error probability is bounded by (3.5.8)
P,(1 > 2) = P,(2 > 1) (3.5.8)
> gltlso) — No(N)
where
u(s) = In d P(y|x1)° “p(y |x2)° (3.7.13)
and sy € [0, 1] is such that
L'(So) = 0
But since the channel is a memoryless binary-input, output-symmetric channel, we
have?!
P(y|X1) a oe (Ve | X14)
P(y|X2) x=4 P(yx| X 1%)
where k refers to any component for which x,, + X >, = X,,. Suppose that x ,, = 0
in | of these components and x,, = 1 in the remaining d,,;, — l. Then (3.7.13)
becomes
ro EP Ofroal |
But for this class of channels, the output space is symmetric [1.e., for every y, there
corresponds a —y such that p;(y) = po(—y)]. Thus
Sal eral = 5 brow) eat 3y + poly) |e? || + poo
In |> poly)
y
u(s) =
: Poly) y>0 Poly) Po(—Y)
. ee elias
=X pol »)| eo (3.7.14)
Hence
M(s) = dinin In Z Poly)’ *Po(—y)*
= din In ¥ Po(y)’Po(—y)'* = w(1 — s) (3.7.15)
Since p(s) is convex U, it has a unique minimum in (0, 1). Furthermore, since
u(s) = (1 — s) this minimum must occur at so = 4, and at this point
u’(4) =0 (3.7.16)
7! To avoid dividing by 0, if p(y,|X,,) = 0, we replace it by ¢. Then we calculate the exponent
(3.7.18), which depends only on Z, and finally let « > 0. The result is that Z is exactly the same as if Z
were calculated directly for the original channel with zero transition probability.
178 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Thus choosing s = so = 3, we have from (3.5.8) and (3.7.15)
Pe(1 +2) > exp |1V [in N) In EV Polo)PoC=¥) — N)]} (717)
i
Consequently, as in (3.7.5), we have, using (3.7.2),
Pr(M) > 2PEma(M/2)
> eNU(dmin/N) In Z—o(N)] (3.7.18)
where
, Meee ps J Po(y)Po(— y)
Finally from Lemma 3.7.1, letting? M > N for any fixed rate as N becomes large,
we have
Bois 1
N ~~ 5) + o(N)
Hence
P,(M) i elt In Z—o(N)]
— e~NiEcx(0)+ o(N)] (3.7.19)
where we have used (3.3.31), which is the zero-rate upper-bound exponent.
Once again we have obtained a result which is asymptotically tight at zero
rate. This same result has been shown for the entire class of memoryless discrete-
input channels (Shannon, Gallager, and Berklekamp [1967]).
3.8 LOW-RATE LOWER BOUNDS*
We have just closed the gap between the asymptotic lower-bound and the upper-
bound expressions for zero rate, as well as for rates above R = E,,(1). We now turn
to narrowing the gap for the range 0 < R < E;,(1). This is partially accomplished
by the following useful theorem.
Theorem 3.8.1 (Shannon, Gallager, and Berlekamp [1967]) Given two rates
R” < R’ for which error bounds on the best code of dimension N are given by
P,(R’) > e7 NUEs(R)+ OW)
P,(R") a gel N[E((R”) + o(N)]
2 This restriction is inconsequential since, provided M grows no faster than linearly with N,
R = (In M)/N>0 as N00.
* May be omitted without loss of continuity.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 179
where o(N) > 0 as N > 00, E,,.(R) is the sphere-packing bound exponent, and
ER) is any tighter low-rate exponent. Then, for the intermediate rate
R = A(R’) + (1 — A4)R’, 0 < A < 1, the error probability for the best code of
dimension N is lower-bounded by
P,(R) > e” NUAEsp(R') + (1 — AEKR”) + o(N)] when R = 4R’ + (1—A4)R",0<A<1
(3.8.1)
In other words, if we have a point on the sphere-packing bound exponent and
another on any other asymptotic lower-bound exponent, the straight line connect-
ing these points is itself an asymptotic lower-bound exponent for all intermediate
rates. In connection with the results of the last two sections, this suggests that we
connect the asymptotically tight result at zero rate with the sphere-packing bound
by a straight line which intersects the latter at a rate as close to E’,(1) as possible.
This of course, is achieved by drawing a tangent from the zero-rate exponent value
to the curve of the sphere-packing bound exponent. The result (see Fig. 3.10) is a
bound which is everywhere asymptotically exact for the unconstrained AWGN
channel and for the limit of very noisy channels,** while for all other channels
when E(0) is finite, it is generally reasonably close to the best (expurgated) upper
bounds.
The proof of this theorem is best approached by first proving two key lemmas,
which are interesting in their own right. The first has to do with list decoding, an
important concept with numerous ramifications. Suppose that in decoding a code
of M vectors of dimension N we were content to output a list of the L messages
corresponding to the L highest likelihood functions and declare that an error
occurred only if the transmitted message were not on the list. Then naturally the
probability of error for list-of-L decoding is lower than for ordinary decoding with
a single choice. However, a lower bound, which is identical in form to the sphere-
packing lower bound, holds also in this case.
Lemma 3.8.1 For a code of M vectors of dimension?* N with list-of-L decod-
ing the error probability is lower bounded by
P,(N, M, L) > e7 MEw(® +00 (3.8.2)
where
~ In (M/L
ee ee oS (3.8.3)
Proor (for binary-input, output-symmetric and AWGN channels) The argu-
ment is almost identical to those in Sec. 3.6 with the exception that now the
decision regions are enlarged to
An 9 {y: P(Y | Xm) > PCy | Xx) for all X; ¢ : on Xm>> Xm, > ‘2% P eae}
3 These channels, for which E,,(0) = E(0) = E,(1), are the only channels for which everywhere
asymptotically exact results are known. See also Sec. 3.9.
4 For the unconstrained AWGN channel, the lemma holds with N replaced by T throughout.
180 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
That is, A,, is the region over which p(y|x,,) is among the top L likelihood
functions. Consequently, each point y € Y, (the observation space) must lie
in exactly L regions; specifically if p(y|x,,) > -*: > p(y|Xm,) are the L greatest
likelihood functions, then y € A,,,, k = 1, 2, ..., L. With this redefinition of
A,,, the pairs of inequalities (3.6.7), (3.6.8), and (3.6.28), (3.6.29), as well as the
forms of p(s), (3.6.9) and (3.6.30), appear exactly as before. However, the
values of the w,, now differ, and this requires changes in (3.6.12) and (3.6.32).
For now
= (3.8.4)
since it follows that, if each y lies in exactly L regions, summing over each of
the M regions {A,,‘ results in counting each point in the space L times. Thus
(3.6.12) and (3.6.32) are replaced by
Vin < L/M (3.8.5)
and the rest of the derivation is identically the same. For binary-input, output-
symmetric channels, this means that we replace?* (3.6.42) by
~ In (M/L)
R = ———
N
and proceed in exactly the same manner as before, thus obtaining (3.6.45) and
(3.6.46) with R replaced by R, which are just (3.8.2) and (3.8.3) of this lemma.
The other key lemma relates ordinary decoding with list decoding as an
intermediate step.
Lemma 3.8.2 For arbitrary dimensions N, and N,, code size M, and list size
L, on a memoryless channel
P,(N, + Nz, M)> P(N,, M, L)P;(N2, L + 1) (3.8.6)
where P,(N, M, L) is the list-of-L average error probability for the best code
of length N with M codewords, and P,(N’, M’) is the ordinary average error
probability for the best code of length N’ with M’ codewords. The two
argument error probabilities apply to ordinary decoding; the three argument
probabilities apply to list decoding.
The intuitive basis of this result is that an error will certainly occur for a
transmitted code vector of length N, + N, if L other code vectors have higher
In (M/L)
25 For the AWGN channel we replace (2.5.14) by R; = =
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 181
likelihood functions over the first N, symbols and if any one of these has a higher
likelihood function over the last N, symbols.
ProoF Let each transmitted code-vector x,, of dimension N, + N, be
separated into a prefix
» oe = (x1 > Xm2> eeey a
and a suffix
fee 5
Xm = (i: (N1+1)> %m, (N1+2)9 *°°9 Xm, (N1+N2))
Similarly let the received vector y be so separated into an N,-dimensional
prefix y’ and an N,-dimensional suffix y’. The overall error probability for
ordinary decoding is, of course, given by (2.3.1) and (2.3.2) as
— 3 > Prl¥|Xm) (3.8.7)
m=1 yeA,,
For each prefix y’ let
A,Ay’) = ty": y = (y, y") € A} (3.8.8)
be the set of suffixes for which the overall vector y is in the mth decision
region. Then, since the channel is memoryless, we may rewrite (3.8.7) as
1 M
Pe= yp LY Pwil¥'|%m) = Pwal¥”|Xm)
m=1 y y” €A’,(y’)
1 M
=F 2d Pwal¥’|Xm)Pen(¥’) (3.8.9)
“<[]
_
where P, (y’) is the error probability for message m given that the prefix y’
was received.
Let m,(y’), m2(y’), ..., m,(y’) be the L values of m (the L messages) for
which the overall error probabilities, conditioned on the prefix y’ being
received, are the smallest. That is
Pr. AY) = Pay) <°:: s Ps, ly) < Pe, ly) (3.8.10)
for every k > L. Consequently, for every
m, € {m,(y’), ma(y’), ..., m(y’)}
it follows that
Py, ,(y’) => Pe(N2, L + 1) (3.8.11)
For suppose on the contrary that
Pz, ,(y’) S Pr, ,(y’) i ates Pr ly’) + Pr ly) < P,(N,,L + 1)
182 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Now restrict the code to only the L + 1 messages m,, m2, ..., m,, m,. The
decision regions could then-be expanded leading to
Pe (Vn ere j= Lok
where the left side of the inequality refers to error events for the L + 1 message
code. Combining these two inequalities, we obtain
Puy; L+1)< P,(N,,L+ 1)
which is obviously in contradiction to the fact that P,(N,, L + 1) is a lower
bound for the best code of L + 1 vectors. Thus (3.8.11) must hold and we can
lower-bound the inner summation in (3.8.9) by
Y _ Pwal¥” |Xm) = Pe, ly’)
y” eANily’)
0 me {m,(y’), ma(y’), ..., mz(y’)}
=
P,(N,,L+1) m=m,(y’) where k > L
(3.8.12)
Substituting (3.8.12) in (3.8.9) and changing the order of summation, we
obtain
1 / /
peaerap 2 » Pw ,(y'| Xm)Pe(N2, L + 1) (3.8.13)
M y’ m=mi(y’): k>L
Finally, consider the prefix symbols x4, x, ..., Xy aS a code of M vectors of
dimension N,. Then again interchanging the order of summation, we have,
using (3.8.10),
1 es 1
are MB Pw, (y Xm) = 99
y m=m,(y’):k>L
2, Pwily'|Xm) (3.8.14)
| eae ey
IMs
where A,, = ty’: me {m,(y’), m2(y’), ..., m(y’)}}-
Hence, the right side of (3.8.14) is just the overall error probability for a
list-of-L decoder and consequently is lower-bounded by P;(N,, M, L). Substi-
tuting this lower bound for (3.8.14) into (3.8.13) we obtain
P,(N, + No, M)>P;(N,, M, L)P;(N2,L+1) (3.8.6)
which thus proves the lemma.
ProoF (of Theorem 3.8.1) Substituting (3.8.2) for Pg(N,, M, L) and an
arbitrary low-rate exponential lower bound for P;(N,, L + 1) into (3.8.6),
we have
P,(Ny + No, M) > e NilEsw(R)+ OW I~ NafEKR”) + (2) (3.8.15)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 183
where o(N,)— 0 and o(N,) > 0 as N, > c and N, > 00, respectively. From
(3.8.3), we have
In (M/L)
R' = ——— 8.
N, (3.8.16)
and we let
ab
R’= N, (3.8.17)
Defining
yepials
N,+N,
(3.8.18)
eas 2
N,+N,
we have, using (3.8.16) through (3.8.18)
In M
R = —_—_
N,+N,
In (M/L) In L
ee ee
N, + (1-A) N,
= AR’ + (1 —A)R’” (3.8.19)
Hence, letting N = N, + N, where both N, > oo and N,- 00, we obtain
from (3.8.15)
P,(R) > eo NIAEsp(R’) + (1 — aEW(R”) + 0(N)]
where
R=AR'+(1—A4)R" OO <A<1
and
Re R =< R
which is just (3.8.1) and hence proves the theorem.
The application of Theorem 3.8.1 involves letting R” = 0 and using the zero-
rate bound of Sec. 3.7 for E,(0). Thus in (3.8.1) we let R” =0, R’ = R/A, and
E,(0) = E,,,(0) so that
P,(R) > e7 NAB sfR/A) + (1 — AME ex(0) + OWN))
E.,(0) ree E.,(R1)
aie
= exp (- £410) - R + o0(N)
Pe Ae |
184 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where R, = R/A, and hence
Cackh <2,
The best choice of R,, obviously, is the one for which the line from E,,(0) at R = 0
has maximum slope, i.e., the rate at which a tangent line from E,,(0) at R =0
strikes the sphere-packing bound exponent (see Fig. 3.10).
3.9 CONJECTURES AND CONVERSES*
In the preceding three sections, we have found lower bounds on the best code, for
given N and M, which agree asymptotically at R=0 and R > E/(1) with the
upper bounds derived in the first half of this chapter by ensemble average argu-
ments. For the low-rate region 0 < R < E,(1), asymptotically tight results are not
available, although the exponents of upper and lower bounds are generally close
together and become asymptotically the same in the limit of very noisy channels.
The most likely improvement in the lower bound for this region will come
about as a result of an improvement in the upper bound on minimum distance. In
the case of binary-input, output-symmetric channels, we found in Sec. 3.7 the
lower bound
Pp > @7 NU- Gmin/N) In Z+ O(N) (3.7.18)
where
Z =) /Po(y)Po(—y) <1
Thus, an upper bound on d,,;, is needed to complete the error bound. In Sec. 3.7,
we derived the Plotkin bound
EE |
al o(N)
where o(N) > 0 as N > 00, which then led to the lower error bound (3.7.19) which
is tight at zero rate. But it is intuitively clear that, the higher the rate, the more
code vectors are placed in the N-dimensional space and the achievable minimum
distance is lower. It is possible to modify the Plotkin bound so as to obtain a form
which decreases linearly with rate (see Prob. 3.33), specifically
= < ; [1 — (R/In 2)] + o(N) (3.9.1)
but this is by no means tight either. A tighter upper bound on d,,,,, is due to Elias
[1960].7° Also of interest is the tightest known lower bound on d,,;,,; this was derived
* May be omitted without loss of continuity.
76 An even tighter upper bound has been derived by McEliece, Rodemich, Rumsey, and Welch
[1977]. Also, see McEliece and Omura [1977].
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 185
by Gilbert [1952] using an essentially constructive argument (see Prob. 3.34). [One
can also derive the Gilbert bound’’ by using the expurgated upper bound (3.4.8)
and the lower error bound (3.7.18) for the binary-input, output-symmetric chan-
nel.] The Gilbert lower and Elias upper bounds on normalized minimum distance
for a binary code of N symbols are, respectively,
din
6(R) < oa < 25(R)[1 — 5(R)] + o(N) (3.9.2)
where 6(R) is the function defined by
R=In2-#(5) 0<6<}3 (3.9.3)
The Plotkin and Elias upper bounds and the Gilbert lower bound are all plotted
in Fig. 3.11.
It is tempting to conjecture, as have Shannon, Gallager, and Berlekamp
[1967], that in fact the Gilbert bound is tight, ie., that
dn;
a = 6(R)+o0(N) [conjecture] (3.9.4)
where 6(R) is given by (3.9.3). For then, at least for binary-input, output-
symmetric channels, we would have, using (3.7.18) and (3.9.3)
Pp > e~ NEMR) + 0(N)]
27 Also known as the Varshamov-Gilbert bound, in recognition of independent work of Varshamov
[1957].
Plotkin upper bound
0.4 Elias upper bound
Varshamov-Gilbert lower bound
US ey
0.2
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7
Figure 3.11 Bounds on d,,;,,/N.
186 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where
E(R)=-—6dInZ _ [conjecture]
R=In2— #(6) (3.9.5)
But interestingly enough, this coincides asymptotically with the expurgated bound
for these channels derived in Sec. 3.4 [see (3.4.8)] so that
E(R) = E,,(R) [conjecture] O<R<In2-# fea (3.9.6)
Finally, for rates In2 — #[Z/(1 + Z)| = EX(1) < R < Ej{(1), the upper-bound
exponent is a line of slope —1, tangent to the curved portions for low and
high rates at E’,(1) and E’(1), respectively. Similarly, by Theorem 3.8.1 if the lower
bound (3.9.6) holds, we could then connect it at the highest rate E’,(1) to the
sphere-packing bound at E’(1) by the same straight line. Thus it appears, at least
for the class of binary-input, output-symmetric channels, that the missing link in
showing that the best upper bounds are asymptotically tight everywhere, is being
able to show that the conjecture (3.9.4) on the asymptotic tightness of the Gilbert
bound is indeed true. No evidence exists to the contrary, but no real progress
toward a proof is evident. Historical precedents demonstrate that when a particu-
lar result is proven for the BSC, the proof can ultimately be bent to cover essen-
tially all memoryless channels. Thus, the asymptotic tightness of the Gilbert
bound is one of the most important open questions in information theory.
The other gap in the results of this chapter involves the behavior of any of the
channels considered at rates above capacity. Since both upper and asymptotic
lower-bound exponents approach zero as R — C from below, it would appear that
there is little chance for good performance above C. In fact, for rates above
capacity, two very negative statements can be made. These are known as the
converses to the coding theorem. The first, more general result due to Fano [1952]
was derived and discussed in Sec. 1.3. It shows that, independent of the encoding
and decoding technique, the average (per symbol) error probability is bounded
away from zero. The second converse, which holds only for block codes, 1s
the following stronger result.
Theorem 3.9.1: Strong converse to the coding theorem?® (Arimoto [1973])
For an arbitrary discrete-input memoryless channel of capacity C and equal
a priori message probabilities, the error probability of any block code of
dimension N and rate R > C is lower bounded by
fee ge ee (3.9.7)
*® Earlier versions are due to Wolfowitz [1957], who first showed that lim P; = 1 for R > C, and
N7o@
Gallager [1968], who first obtained an exponential form of the bound.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 187
where
E,(R)= max |min E,(p,q)—pR|>0O forR>C (3.9.8)
—-1<p<0O q
and E,(p, q) is given in (3.1.18).
{Note that E,,(R) is a dual to the form of E(R) given in Sec. 3.1. The main
difference is that the parameter p is restricted to the interval (—1, 0).}
PRooF We bound the average probability of correct decoding of an arbitrary
code by first examining the form
ue Y pvl [¥n)
yeAm
>» max py(y|x,) (3.9.9)
m
This follows from the fact that the optimum decision regions are defined as
|
A. = y max Puy |Xm') = Pr(¥ |Xm) mod 2, MA. 439.10)
Now for any f > 0, we have
B
oe eee (max pvty xa)"
m
M B
< ( > pr(y xa)" (3.9.11)
m=1
Defining a special probability distribution on codewords, namely
1
uM X =X, eas Sige eee yj
n(x) = ¢
| 0 otherwise (3.9.12)
gives us the relation
B
os Px(¥|Xm) < (w > 4w(x)py(y |x)"?
B
= M#( 5, anCodpv(y |x)" (3.9.13)
Using this in (3.9.9), we have the bound
Pcs ae d ( Gn(x)py(y |x)" )
B
+ agins - ( antx)rv(y1x)"” (3.9.14)
188 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where the maximization is over all distributions qy on #y , not just the special
distribution of (3.9.12). Defining the parameter
p=Bp-12>-1 (3.9.15)
where p = — 1 is taken as the limit as p > —1 from above, we have
1+ p
Pc < M? max (y av(s)pnty|x)"* (3.9.16)
Qn y x
In Lemma 3.2.2, we showed that
2 (x qn(x)Pn(y pore) (3.9.17)
is a convex U function over the space of distributions qy(-) on 2y for p > 0.
For p <0, the same proof of Lemma 3.2.2 shows that (3.9.17) is a convex 0
function over the space of distributions qy(-) on 4. We now restrict p to the
semi-open interval p € (—1, 0]. The Kuhn-Tucker theorem (App. 3B) shows
that there is a unique maximum of (3.9.17) with respect to distributions on Xy
and that it satisfies the necessary and sufficient conditions
> pry |x)" * aly, avy? <> ay, av)’ *? (3.9.18)
where
a(y, qv) = > an(x)py(y |x)/07”
x
for all x € 2y with equality when gy(x) > 0. This maximization is satisfied by
a distribution of the form
y .
an(x) = [] a.) (3.9.19)
n=1
where q(-) satisfies the necessary and sufficient conditions
Y Ply |x) *%a(y, a? <¥ ay, q)**? (3.9.20)
y y
where
a(y, q) = > a(x)p(y|x)/0*”
x
for all x € X with equality when q(x) > 0. Hence from (3.9.16), we have
1+p|N
P. <M? max » ( atsdo(v |x)" |
q y x
= exp | min [E,(p, q) — PR (3.9.21)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 189
Minimizing the bound with respect to p € (—1, 0] yields
Poet (3.9.22)
and hence (3.9.7) when we use P; = 1 — Pe.
For R > C we can show E,,(R) is greater than zero by examining proper-
ties for E,(p, q) for —1 <p <0. Using Lemma 3.2.1, which is proved in
App. 3A, we have for -—1 <p <0
E,(p, q) <9 (3.9.23)
with equality if and only if p = 0. Further, we have
0E,(p, 4)
——— >0 3.9.24
se (3.9.24)
0°E,(p, 4)
SS 3.9.25
dp? ae 0 ( )
and
0E,(p, q
I(q) = a (3.9.26)
p p=0
With these properties, we see that E,,(R) > 0 for R > C by using arguments
dual to those used in Sec. 3.2 to show that E(R) > 0 for R <C.
This concludes our discussion of converses as well as our treatment of error
bounds for general block codes.
3.10 ENSEMBLE BOUNDS FOR LINEAR CODES*
All the bounds derived so far pertain to the best code over the ensemble of all
possible codes of a given size, M, and dimension, N. However, virtually all codes
employed in practical applications are members of the much more restricted
ensemble of linear codes. Clearly the best linear code can be no better than the best
code over the wider set of all possible codes. Hence, all the previous lower bounds
also apply to linear codes. The problem is that the upper bounds, based on
averages over the wider ensemble of all codes, must now be proved over the
narrower ensemble of linear codes. It turns out that this task is not nearly as
formidable as would initially appear. We shall consider here only binary linear
codes and binary-input, output-symmetric channels, but the extension to the codes
over any finite field alphabet is straightforward.
A binary linear code of M = 2* code vectors, as defined in Sec. 2.9, is one in
which the code vectors {v,,} are generated by a linear algebraic operation on the
data vectors {u,,}, the latter being lexicographically associated with all 2* possible
binary vectors from u, = 0 to ux = 1. We generalize the definition of linear codes
* May be omitted without loss of continuity.
190 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
of Sec. 2.9 to one for which the code-vectors contain a constant additive vector vo,
that is
X,=V,=8.6Oy, m=1.2....2° (3.10.1)
where
911 912 Jin
CaS aaa
|Gk1 = GK2 9KN
and
Vo = (V1; --+» Von)
are an arbitrary binary matrix and binary vector. We take here L = N and we take
the signal vectors {x,, = V,,} to be the code vectors. The additive vector vy is an
unnecessary artifice for output-symmetric channels but becomes necessary for the
proof in the absence of symmetry. It is clear that the ensemble of all possible
binary linear codes contains 2“ * !’" members, corresponding to all distinguishable
forms of G and vo.
The average error probability of the mth message over the ensemble of all
possible linear codes is, analogously to (3.1.1)
eae 1
Pe, = SKF DN Pe a Peixs cees Xx) (3.10.2)
(X ps Xz5 +00» Xp) © Fen + 1yN
where ¥(x + 1)n is the space of all possible signal sets generated by (3.10.1). Substi-
tuting the error probability bound for a specific signal set (3.1.4), we have for
i
1
Pe, <2 5+ DW yey Py(y |x.) 7?
y (KX ys Xq0 v0 Xa) ELK +N
M p
<) Y pyly ke SO (3.10.3)
m’'=2
But
X,;=V, =0G+V,
= Vo
and hence can take on any of 2" values. However, once x, is fixed by the choice of
Vo, the remaining signal vectors x5, ..., X, jointly can take on just 2*% possible
values depending only on the KN binary degrees of freedom of the matrix G. Thus
we may express (3.10.3) as
sat eck 1
DB N » Pn (y Bo eis
y xX]
ait os: 5 Paty mw) || p>0 (3.104)
(x5, Peed Xu) € LKN m’=2
x
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 191
where Yxy is the space of all signal sets generated by (3.10.1) when vg is fixed.
Using the definition (3.1.6) we find, analogously to (3.1.7) and (3.1.9) with
0 < p < 1, that the expression in brackets in (3.10.4) becomes
1
DEN Bo 2; Dini ee
X45 «+9 Xyg) © LY KN
: 1 a
= KN 2 ee i
Es (X54, «++ Xyq) © Ln
I 1/(1 +p) j
= fom) seers 2, ¥ pul |X)
i (X55 «02s Xu) € KN m’=2
A 1 M z
= 5KN 2 y ees >: Pyly ae ae ‘| (3.10.5)
é m’=2 (X53, ---»XM)€ Ln
But clearly, for m’ + 1, any given value of x,,, € 2y can be obtained by choosing
some row vector of the G matrix to be an appropriate distinct function of the
remaining row vectors. However, this only leaves 2“~ ’" choices for the remain-
ing vectors. Thus
XY pyly |X)! +)
(X54, -++5 Xyg) © LN
ae yey Pw ly | Xm)
(X45 -0s Xr _ gs Mayr g gs vers Kay) €L(K-1)N Xm
= 2K- DNS pr(y|x)t +) (3.10.6)
Combining (3.10.4) through (3.10.6), ae definition (3.1.6), we obtain
28 1
Py aw Puy [x1)/°* 3 Y ax Pu [Xy) lia |
y &,
im’=2 Xm
1 | 1 p
me ~ ax Ply |X )"O*" |(M — 1) ¥ se ety |x")
1+p
= (My SS ovolyer9|” o<pst
(3.10.7)
which is identical to (3.1.10) with qy(x) = 1/2%. Clearly P,; can be identically
bounded by interchanging indices m and 1 throughout, and the rest of the en-
semble average upper-bound derivation (i., the balance of Secs. 3.1 and 3.2)
follows identically to that for the wider ensemble of all block codes. Thus all the
results of Sec. 3.2 hold for binary linear codes also (with Q = 2) when q(0) =
q(1) = 4, which holds for output-symmetric channels. As we found in Sec. 3.6,
this bound is asymptotically tight for all rates R > E,(1) for this class of channels.
To improve the upper bound at low rates, for the wider ensemble of all block
codes we employed an expurgation argument (Sec. 3.3). However, for binary
192 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
linear codes, the proof of the expurgated bound is easier than the more general
proof of Sec. 3.3. Indeed, expurgation of codewords is not necessary. For M
binary codewords of a linear code used over any binary-input memoryless chan-
nel, we have from (2.3.16) the Bhattacharyya bound for the mth message error
probability
Pr, < » 2, J/ Pn ¥|Xm)Pw(¥ |Xm’) (2.3.16)
m' =m y
For binary code vectors x,, and x,,,, we have
d J Ply |Xm)Pn(Y |Xm’) = » /Po(y)p1(y)
where w(-) denotes the weight of the vector and here w(x,,®xX,,) equals the
number of symbols in which x,,, differs from x,,. Thus
W(Xm BD Xm’)
(3.10.8)
W(Xm OD Xm’)
Pes d » VPs i0)| (3.10.9)
m'#=mt{y
For any linear code of the form
XX =V =UG Ot, (3.10.10)
we have from”? (2.9.10), for any m
{W(Xim © Xm’) for all m’ # m} = {w(u, G), w(uz G), ..., w(Uy G)}
Thus
M W(Up,G)
Pros Di polPi0) (3.10.11)
m’=2Ly
for m= 1, 2, ..., M. Since the bound is the same for all codewords, we have
M w(u,,G) |
Pea ip: Vrs :0)| (3.10.12)
m=2\Ly
Note that this is exactly the form of (2.9.19) but holds for arbitrary binary-input
memoryless channels, without the requirement of output symmetry.
Defining
Z= Lv Po(y)P1(y) (3.10.13)
and the parameter 0 < s < 1 we have the inequality (App. 3A)
M
< ¥ ZG) (3.10.14)
2° We identify v,, there with v,,@ v) here and note that x,, @ Xm = (U, ® u,,)G.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 193
Ps, and its bound (3.10.14) depend on the particular code generator G as
shown in (3.10.10). We next average P{ and its bound over the ensemble of all
possible binary linear codes which contain 2*” members, corresponding to
all distinguishable forms of G. The average of (3.10.14) over all possible linear
codes is
ea : ee =
Pe Fo 2. 5KN Y Zsw(amG)
G m=2
eal 1
=> > aan 2 (3.10.15)
m=2 G
where we sum over the space of all possible generator matrices G. Noting that
each generator matrix consists of K rows of dimension N, we can express (3.10.15)
in terms of row vectors of the generator matrices as follows.
(eee: M 1 K
: < oy y- ae pd (5. Zswlum18 1D Um282G °° UmKBK) (3.10.16)
ra Bj 8x
where now, for each row, we sum over the space of all possible row vectors. In this
case, all the row vector spaces are the same N-dimensional binary vector space,
#y. For each m¥ 1, we have u,, #0 and hence, in u,,G = u,,;2; ® Un2 22 O
++ ® Unx Bx, at least one row vector adds into the sum to form u,, G. Varying over
all 2** possible matrices G and taking the sum of the rows g,, for which u,,, 4 0
results in 2““~ »)-fold repetition of each of the 2" possible N-dimensional vectors
x = a, G. Thus
(0) ) me ) sw tm 181 D tm282® °° mK BK) — y sq 2S
SiNA 1
725 | k )59 “fi
a +2) (3.10.17)
Combining (3.10.16), (3.10.17), and using (M — 1) < M yields
PE<M(+ +2) (3.10.18)
Hence there exists at least one linear code for which
PE< mn(it2)" 0<s<l (3.10.19)
or with parameter 1 < p = 1/s < c0
pia me! + aa
E
= e~NIEs(p, 1/2)- pR} (3.10.20)
194 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where
1/p
1 + (5 Vpo0)P.0)]
E,(o, 3) = —p In ” 5 (3.10.21)
Minimizing over p > 1, we have that, for at least one binary linear code
Pixs ae (3.10.22)
where
E.,(R) = Te [E.(p, 4) bac pR]
pa
which corresponds to (3.3.12) and (3.3.13) for this class of channels and is given in
parametric form in (3.4.8) with Z given by (3.10.13).
Thus, for linear block codes over output-symmetric channels, we have obtained
the ensemble average upper bound of Sec. 3.1, and we have demonstrated that the
expurgated error bound of Sec. 3.3 holds regardless of whether or not the channel
is output-symmetric. In the next three chapters we shall consider a special class
of linear codes which can be conveniently decoded and which achieves performance
superior to that of linear block codes.
3.11 BIBLIOGRAPHICAL NOTES AND REFERENCES
The fundamental concepts of this chapter are contained in the original work of
Shannon [1948]. The first published presentation of the results in Secs. 3.1 and 3.2
appeared in Fano [1961], as did those of Sec. 3.10 for the ensemble average. The
present development of Secs. 3.1 through 3.4 is due to Gallager [1965]. The lower-
bound results in Secs. 3.5 through 3.8 follow primarily from Shannon, Gallager,
and Berklekamp [1967]. The strong converse in Sec. 3.9 was first proved by
Wolfowitz [1957]; the present result is due to Arimoto [1973].
APPENDIX 3A USEFUL INEQUALITIES AND THE
PROOFS OF LEMMA 3.2.1 AND THEOREM 3.3.2
3A.1 USEFUL INEQUALITIES (after Gallager [1968],
Jelinek [1968])°*°
Throughout this appendix we use real positive parameters r>0, s > 0, and
0<A<1. Letting J = {1, 2, ..., A} be an index set, we define real nonnegative
numbers indexed by I
a; >0 b; >0 forie I
°° See also Hardy, Littlewood, Polya [1952].
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 195
and probability distributions indexed by J,
P;>0 0,;>0 forie I
where
We proceed to state and prove 11 basic and useful inequalities.
(a) Inr <r—1 with equality iff?! r= 1
Proor f(r) = In r — (r — 1) has derivatives
f(a = 1
r
Since f”(r) <0, we have a unique maximum at r= 1. Hence f(r) = In r —
(r — 1) < f (1) = 0 with equality iff r = 1.
A A
(b) | a? < ¥ Pa; with equality iff
i=1 i
=1
A
a;= ) Pa; for all ie I such that P; > 0
A
i=1 i=1 bate gy
hee
A
a
< ¥ P;| =~ - 1
=
>, Pia;
ti
with equality iff
>! Note that iff denotes “if and only if.”
196 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
A
(c) } Q?P}~* <1 with equality iff P; = Q; for all ie I.
i=1
Proor From (b) we have for each i € I
bia ~* < Ab; + (1 — Aja;
with equality iff a; = b;. Hence, substituting P; and Q; for a; and b; and
summing over i,
A A A
2 Or EAs eC FU Hes Ff,
i=1 i=1 i=1
= |
with equality iff P; = Q; for allie I.
3
A
=1
1 Wey!
pein? (Holder inequality)
@) Sab < (Sa)
with equality iff, for some c, a} ~* = cb? for all i € I.
.
l
ProoF In (c), for each i € J, let
1/Aa Titi A
“ a}! us bi )
Q; = A i. oe
oS a;!* Te
j=1 j=1
The special case 1 = 4 gives
A A LZ A 5 WR.
y aib; < ( ys a] | be oF (Cauchy inequality)
i i i=1
and the integral analog
1/2
| a(x)b(x) dx < ( | a(x) dx )( | b?(x) ax) (Schwarz inequality)
i lire
A A Ad <A
(e) > P,a,b; < | > Pa; ‘) | > P; pa (variant of Holder inequality)
j i=1
with equality iff, for some c
P,a;* = cP,bfO-* for alk ie I
More generally, if the g; are any nonnegative real numbers indexed by /, then
A A va
3 gia;b; < ( > aa}*) | gibi)
i=1 i=1 1
=
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 197
PRooF Let 4; = g/a; and b; = g}~*b; be used in (d).
A
A se | A Pi
(f) | = P.ai] = >) Pia ( 3 Pat!) (Jensen inequality)
i=1 i=1 i=1
with equality iff, for some c, P;a; = cP; for allie I.
PRooFr The upper bound follows from (e) ith b; = 1 for all i € J. The lower
bound follows from (e) with 4; = a? and b; = 1 for allie I.
A eae 1/
(g) | Yai! ‘) = } a) = | - ai) with equality iff only one a; is nonzero.
PRooF Let
P,= = for allie I
» 4;
j=1
Since P; < 1 we have
Peer ar
with equality iff P; = 0 or 1, and thus
A A A
PilA< S P,=1< YP
1 i=1 i
with equality iff only one a; is nonzero.
Thus
M>
.~)
_
ey
M >
ac)
as
I|
1
A
I
eel
rr
-_
———
we
IM»
=)
wee
~——
a
ne
|
and
1/s
(h) plage < (5: Pict) O<r<s
with equality iff for some constant c, P;a; = cP; for all i ¢€ I.
198 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
ProorF Let 5; = 1, a; = a’, and 4 = r/s € (0, 1) in (e).
ar
A _ ast ar A as[ A
(|S ovate] < 15 rat] "| 5 oat
i=1 i=1 i=1
where A = 1 — A, with equality iff, for some c
cQ;ai'*=Q,a}" forallie!
PRooF Let
AS
ype QO; ai mi Fae Ar) b, ae peed Ar) pu yr z
in (e).
(j) Let ay, be a set of nonnegative numbers for 1 <j <J and 1<k<K. Then
J K 1/A]A K J A
(San) [<b (EK)
j=1 \k=1 k 1
and
K Aj1/a
| > ay} | (Minkowski inequality)
1/4 K K (1/a)-1
es] z (2 “n)| 2 es]
=1
(1-A)/a
a]
lA
M>
Pe
eas
Ms
L<)
ee
»
Fee
SS
ee
$ ($0)
or, by dividing both sides by the second term on the right, we have
(de) |< 2(E6)
k=1 \j=
The second inequality follows from this one with the substitution 4, = a%,.
1/A
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 199
(k) Let aj, be a set of nonnegative numbers for 1 <j <J,1<k< K. Then
mY K 1A A K P y
£ ofS)" <§ (ou
k=1 \j=1
and
K J 1/A
2 (» 0,04 =
k=1 \j=1
1/A
J K -
* 0,( > ay} | (Variant of Minkowski inequality)
j=1 c=4
Proor Let aj, = Q?a;, in (j) for the first inequality and 4, = Q}/*a,, in (j) for
the second inequality.
3A.2 PROOF OF LEMMA 3.2.1
1+ p
E,(p,q)= —In > |) acstyx)*?] (3A.1)
y x
From inequality (h) we have, for —1 < p,; < pz
1+p1 i +p
>: q(x)p(y |x)" ‘on > »; q(x)p(y| x) ‘ol (3A.2)
with equality iff, for some c, q(x)p(y|x) = cq(x) for all x € 2. Hence
E,(01, q) a E,(p2 ) q) (3A.3)
with equality iff, for every y € Y, p(y|x)is independent of x € % for those x for which
q(x) > 0. But this is impossible since we assumed I(q) > 0 [see property 1 given in
(1.2.9)]. Thus E,(p, q) is strictly increasing for p > —1 and hence
E
GE,(p, 4) | 4 es (34.4)
Op
Also
E,(p,q)>E,(0,q)=0 pO (3A.5)
with equality iff p = 0. The inequality is reversed for —1 <p <0.
Letting Ae (0, 1), and p, = Ap, + Ap, (where 4 = 1 — A), we have from in-
equality (i) upon letting s=1+p, andr=1+ pp,
Z abstr] ihe » abstylxyher”
A(1 + p1)
A(1 + p2)
| (3A.6)
*: » q(x)p(y |x)"
200 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Summing (3A.6) over all y € Y, we have
5 [E acomoiere |” sy fF acoro lye
Ai + p1)
A(1 + p2)
x y q(x)p(y |x) ‘ns
(3A.7)
Applying inequality (d) to the right side of (3A.7), we have
core 1+pi)\a
> » q(x)p(y | ee ‘ra 4 >. >; q(x)p(y | x)< ‘n |
y Lx .
y
xX
abstr} ayer)" G8)
2
Ne
Taking —In ( ) of both sides of this last equation yields the desired result
E,(Ap; + Ap2, q) = AE,(91, 4) + AE,(p2, 4) (3A.9)
This proves that E,(p, q) is convex - in p for p > —1 and therefore
0°E,(p, 4)
wil Pra 3A.
dp? oe 0 ( 10)
Equality is achieved in (3A.9) and (3A.10) iff we had equality in the application of
inequalities (i) and (d) that led to (3A.9). Inequality (i) resulted in (3A.6) where for
the given y we have equality iff, for some c,
q(x)p(y |x)" PP = cyq(x)p(y|x)/*" for alll x (3A.11)
Thus equality holds in (3A.7) iff (3A.11) holds for each y. Inequality (d) used to
obtain (3A.8) holds with equality iff, for some c’
1 1+p2
3 absty| xiv” =¢' D absty|xyhor for all y
; : (34.12)
In (3A.12), because of (3A.11), we can factor out p(y|x) = c) > 0 to obtain
1+p
1+p1 1+p2
avs) =’ q(x) for ally § (3A.13)
x: Ay|x)> 0 x: P(y|x)> 0
This implies that for some constant «
>» @x)=a forall y (3A.14)
x: P(y|x)>0
Thus, for all x, y such that q(x)p(y|x) > 0, we have
>, a(x')p(y |x’)
aoe Bs == Of (3A.15)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 201
or as a consequence of definition (3.2.2)
a ia (3A.16)
"Labs )p(y |x’)
3A.3 PROOF OF THEOREM 3.3.2
E,(p, q) =—plny Yah |S Voor ory] ”
eae q(x)q(x’ Soon ror] ‘i (3A.17)
Let 1 < p, < p,. From inequality (h), we have
EE ates ]E Ves ooT) |” |”
=a) 2, aa 5 /P(y|x)p(y|x’) a (3A.18)
with equality iff, for some c
d /P(y|x)p(y|x’) = (3A.19)
for all x, x’ such that q(x)q(x’) > 0. Hence E,(p, q) is an increasing function of p
for p > 1.
Let us examine the condition for equality given by (3A.19). For any x such
that q(x) > 0, we have trivially q(x)q(x) > 0 and
>Y Vply|x)p(y|x) = ¥ p(y| x)
a4
ae (3.20)
Furthermore inequality (c) states that
> Vey|x)p(y|x’) < 1 (3A.21)
y
with equality iff p(y|x) = p(y|x’). Hence equality in (3A.18) is achieved iff
Ply |x) = p(y|x’)
for all y and all x, x’ such that q(x)q(x’) > 0. This is impossible since we assume
I(q) > 0. Thus E,(p, q) is strictly increasing with p for p > 1
0E,(p, q)
ae (3A.22)
202 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and
E,(p,q)>E,(1,q)>0 forp>1 (3A.23)
Next, from inequality (i), it follows that, for any J € (0, 1) and p, = Ap, +
Ap,, we have
ISS aeemcv | eC TsdO) |" |"
< [EE abyer|S Veoraor] |”
x 2, 2, alx)a) 3 Votvlseore)| "|" (3A.24)
with equality iff, for some c
Y Vely|x)p|x’) = ¢ (3.25)
y
for all nonzero values of
» /P(y|x)p(y|x’) where q(x)q(x’) > 0
From inequality (c), we again have that this sum is 1 iff for all y, p(y|x) = p(y|x’).
The sum is 0 iff, for all y, p(y|x)p(y|x’) = 0. Thus from (3A.24) we have
E.(Ap; + Ap2, @) = AEx(01, 4) + AEx(p2, q) (3A.26)
or equivalently
0°E,(p, q)
i ;
aaa 0 (3A.27)
with equality iff, for every pair of inputs x and x’ for which q(x)q(x’) > 0, either
P(y|x)p(y| x’) = 0 for all y or p(y|x) = p(y|x’) for all y.
APPENDIX 3B KUHN-TUCKER CONDITIONS AND PROOFS
OF THEOREMS 3.2.2 AND 3.2.3
3B.1 KUHN-TUCKER CONDITIONS
Theorem (Gallager [1965]—-special case of Kuhn and Tucker [1951]) Let f(q)
be a continuous convex - function of q = (q1, q2,---, dg) defined over the
region
Q
Po= q:)>) g=t 4q,20;k =1,2,...,0
l=1
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 203
Assume that the partial derivatives Of (q)/0q,, k = 1, 2, ..., Q exist and are
continuous, except possibly when q, = 0 (on the boundary of Pg). Then f (q)
has a maximum for some q° € Pg and necessary and sufficient conditions on
q° = (qf, ---, Yo) to maximize f(q) are that, for some constant /
ad ae k=1,2,...,0 (3B.1)
Od: q=q°
f(a) =i for all k such that g? #0 (3B.2)
Og, q=q°
It is well known that, in real vector spaces without constraints, a convex O
function either has a unique maximum or, if it possesses more than one maximum,
they are all equal, and all points on the line, plane, or hyperplane, connecting
these maxima, are also maxima of the function. Also, necessary and sufficient
conditions for maxima are that all partial derivatives be zero.
Now, if we impose a linear constraint such as
Q
+a=1
k=1
then, by the standard technique of Lagrange multipliers, this can be treated as the
problem of maximizing
fla)+4 4
which yields then (3B.2), and 4 can be obtained from the constraint equation. On
the other hand, if the region Ag is bounded by hyperplanes (q, > 0), we must
recognize that a maximum may occur on the boundary, in which case (3B.2) will
not hold for that dimension, but it would appear that (3B.1) should (see Fig. 3B.1
for the one-dimensional case). We now proceed to prove (3B.1) and (3B.2).
/
(a) Maximum at interior point (3B.2) (b) Maximum on boundary (3B.1)
Figure 3B.1 Examples of maxima over regions bounded by hyperplanes.
204 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
PRrooF Necessity: Assume f (q) has a maximum at q®. Let q’ = (4), q2,---, 9a)
be a distribution vector with q;, > 0 for all k. Since q? maximizes f (q), we have,
for any 6 € (0, 1)
0 > f(6q' + (1 — 6a”) — fq”) (3B.3)
[Note: O0q' + (1 — 4)q° is interior to Ag since q’ is interior to Ag.] Then
consider
as{a' + (1- Oa] _ 2 afta)
dé k=1 Oe
Since q = 0q' + (1 — 0)q° is interior to Pg, all partial derivatives exist by the
hypothesis of the theorem, and consequently the left side also exists.
Obviously
(%-%) (3B4)
q = 6q'+(1—6)q°
q° = 6q' + (1 — 6)q°
0=0
so that, by the mean value theorem, we have from (3B.3)
’ ne 0
o > Hlea' + - oe]
7a for some « € (0, 0) (3B.5)
Using (3B.4) and letting 0 > 0, we obtain
pg
O>lim ) Boe (% — a)
a0 k=1 [x |q=aq’+(1—«)q°
and since the derivatives are continuous by hypothesis
ee y of (a)
ne 3B.6
HX aa, (4k Ik ) ( )
q=q°
Now, for some k = k,, we must have qy, # 0. Let k, be any other integer from
1 to Q. Now since q’ was an arbitrary point in Pg, let us choose it such that
Gea — Teo = Ihr — Thr = € > 0 (3B.7)
q, = a for all k +k, or k, (3B.8)
This is always possible, since q?, + 0 and (3B.7) and (3B.8) guarantee that q’
so chosen is a distribution vector. Substituting (3B.7) and (3B.8) in (3B.6),
we have
0> (7 lee <) (3B.9)
04x, 04x q=q°
Now define
,-F%@) (3B.10)
Oks q=4°
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 205
Since ¢ > 0 in (3B.9), it follows that
S@) cy (3B.11)
04k, q=q°
But, since k, is arbitrary, this establishes the necessity of (3B.1). Furthermore,
if g~, #0, we could take ¢ <0 in (3B.7). This reverses inequality (3B.11),
which thus proves the necessity of (3B.2).
Sufficiency: Now given (3B.1) and (3B.2), we show that
f@)=f a’) forallq’'e A
Given (3B.1) and (3B.2), we have
of
Se (9: — 4k) < A(4k — Qe)
qk |\q=q°
with equality if g? + 0. Summing over k, we have
Q of q Q Q
» oe (a, — 8) < (Yai - af) =0
k=1 Ox i=1 k=1
Now (3B.4) yields
q=q°
df [@q’ + (1 — 0)q°] 2 af (q) 0
= — _— <Q
dé oP ya 04, not Qk )
or equivalently
’ ae te Eee 0
9-0
But, since f (q) is convex -, the left side of (3B.12) can be replaced by
f[Oq' + (1 — 0)q°] —f(q°) _ Of(q’) + (1 — 8) F(q°) —F@
[Oq' + ( a (a), Hla’) + ( foes Sa) _ fq’) —f(@°)
which proves the sufficiency of (3B.1) and (3B.2).
3B.2 APPLICATION TO E,(p,q) AND I(q)
PROOF OF THEOREM 3.2.2 We showed in Lemma 3.2.2 that e~ *°"’® is convex VU.
Thus
f (p, q) = —g~ Folp, a) — -) a(y, q)! +p
y
is convex , and maximizing f(p, q) is equivalent to maximizing
E,(p, q) = —In[—f (9, q)]. Then applying (3B.1) and (3B.2), we have
Of (p,q) _ » Ouly, q)
cae Ses = i +p) daly, q) q(x)
= — (1+ p)¥ ply| x)! * aly, a?
206 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
since
a(y, q) = > a(x)p(y| x) *”
x
Thus
> Ply |x) * aly, q)? =A’ = x forallxe 2% (3B.13)
ad
with equality if q(x) #0. Summing over 2%, after multiplying by q(x) and
interchanging the order of summation, we have for the left side of (3B.13)
Y ate) E pols)" *Panay = ¥ aly, a Eatery sy
>
ice aly, q)’ =f
and for the right side
di aix)A= A
Thus (3B.13) requires
a(y, q)iT? = 2’ (3B.14)
[since (3B.13) holds as an equality if g(x) >0 while, if g(x) =0, it did not
figure in the sum on either side]. Thus combining (3B.13) and (3B.14), we have
Y p(y|x)/4 + aly, gq)’ > aly, q)'*? forallxe % (3.2.23)
MM
with equality for all x such that q(x) > 0.
PROOF OF THEOREM 3.2.3 In Lemma 3.2.3, we proved that I(q) is convex -.
Thus applying (3B.1) and (3B.2), we have
élq)__—a ae
Balx’) ~ aghey | & WP) In PO)
—) d a(x)p(y|x) In 2 q(x”)p(y |x”)
Z aye SPORE oh
ec lar
aa (3B.15)
Summing over x’ € %, after multiplying by q(x’), we have for the left side
: i Og pian
2 a) 2, Py|x)} Fae" POr) 1=I(q)-1
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 207
and for the right side, of course
Lax’ r=s
x’
Thus 2 = I(q) — 1, and consequently (3B.15) becomes
x ply |x) = r xe
d Ply| ) In Fall) <1 C for allxe % (3.2.2)
[since q maximizes I(q)] with equality for all x such that q(x) > 0.
APPENDIX 3C COMPUTATIONAL ALGORITHM FOR
CAPACITY (Arimoto [1972], Blahut [1972])
We have a DMC with input alphabet 2, output alphabet Y, and transition
probabilities p(y|x) for x € 2, ye Y. Let q = {q(x): x € 2%} bea probability dis-
tribution on %. Then channel capacity is
C = max I(q) (3C.1)
where
Ha) = 5 ¥ rcv |s)ats) tn Sa
= 5 ¥ ply|x)aes) mn SY @c2)
where
a(x|y) = . a 3C3)
and
p(y) = ¥ ply|x)a(x) (3C.4)
Let Q = {Q(x|y): x € %, y € Y} be any set of conditional probability distribu-
tions; then
Q(x|y)>0 for all x, y GCS)
and
YOAx|y)=1 forall y (3C.6)
x
208 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Let
Fla, Q) = E ply|x)aex) n SEL» (C7)
Lemma For any Q we have
I(q) = F(a, Q) (3C.8)
with equality iff Q(x | y) = q(x|y) for all x, y.
ProoF From inequality (1.1.8) we have for any y
2 a(e|y) In 7 <¥ q(x|y) In
(x rare ; O(x|y)
with equality iff Q(x |y) = q(x|y) for all x. Observing that
(3C.9)
P(y| x)a(x) = 4(x|y)p(y)
we see that (3C.8) follows directly from (3C.9).
This lemma then yields
I(q) = ma F(q, Q) (3C.10)
where the maximum is achieved by
p(y |x)a(x)
es 5 nix a
for all x, y (3C.11)
Channel capacity can be expressed in the form
C = max max F(q, Q) (3C.12)
q Q
Suppose now we fix Q and consider the maximization of F(q, Q) with respect to
the input probability distribution q. First we note from (3C.7) that, for fixed Q
O)= LAD x +X Pll xia x) In Q(x|y) (3C-43)
is a convex © function of the set of input distributions q. The Kuhn-Tucker
theorem (App. 3B) states that necessary and sufficient conditions on the q that
maximizes F(q, Q) are
<A Y= for al (3C.14)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 209
with equality when q(x) > 0. A is chosen to satisfy the equality constraint
» a(x) = 1
For q(x) > 0, this becomes
—1— In q(x) + ¥ p(y|x) In Q(x|y) = 4 (3C.15)
F y
ats) = exp |-1 A + p(x) in QE) (C.16)
Choosing / to meet the equality constraint
ed are
we have for q(x) > 0 |
exp Ir p(y|x) In Q(x |
q(x) = ? (3C.17)
Zp X p(y|x’) In Q(x’|y)
Hence we have (3C.11) for the Q that maximizes F(q, Q) for fixed q, and we have
(3C.17) for the q that maximizes F(q, Q) for fixed Q. Simultaneous satisfaction of
(3C.11) and (3C.17) by q and Q achieves capacity.
The computation algorithm consists of alternating the application of (3C.11)
and (3C.17). For any index k = 0, 1, 2, ..., let us define
O(x|y) = P(y|x)q™(x)
= : - for all x, 3C.18
¥ pox ee) : Sec
exp
Y p(y|x) In Q(x y)
ox) =
for all x (3C.19)
» exp p(y|x’) In o6'|y)}
and
C(k + 1) = F(q**», Q®) (3C.20)
The algorithm is as follows:
Step 1. Pick an initial input probability distribution q® and set k = 0. (The
uniform distribution will do.)
210 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
Step 2. Compute Q™ according to (3C.18).
Step 3. Compute q“*") according to (3C.19).
Step 4. Change index k to k + 1 and go to Step 2.
To stop the algorithm, merely set some tolerance level 6 > 0 and stop when index
k first achieves
IC(k + 1) —C(k)| <6 (3C.21)
There remains only the proof that this algorithm converges to the capacity.
Theorem For the above algorithm
lim |C — C(k)| =0 (3C.22)
k->o
PROOF Let
r**D(x) = exp (> p(y|x) In Q™(x|y) REO 12 es
>
so that, from (3C.19)
(k+1)
gr be )= 5 me *)
r
(3C.24)
From (3C.12), we have C > C(k + 1) where now
Ck + 1) = Fa”, Q*)
seis |
= Ld Ply|x)gh* P(x) In om
gare. 2 (k + ey pk + )]
x’
= gtx
— In r** (x) + In (> m+n60) BC 25}
x’
)d, Ply |x) In Q(x] y)
From the definition of r** (x) in (3C.23), we see that the first two terms cancel
giving us
C(k + 1) =in(y pk+ D(x ) (3C.26)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 211
Now suppose q* achieves capacity so that C = I(q*). Consider
pikt Mix)
q( of pikt ris E @)]
1
= —C(k+1)+¥ q*(x) In 7) + ¥° g*(x) In r**1(x)
= Ck +1) +E ¥ pol x)ae(s) In
+ 2 > P(y|x)a*(x) In Q(x | y)
¥ gtx) in LO) =» gtx) n
" ” yes 8. § 3 ead
faa
q(x)
Oo" (x | y)
q®(x)
ae x )g*(x) In Ply |x)p*(y)
Ste
)
—C(k + 1) +>) > p(y|x)q*(x) In
p*(y
= —C(k+1)+C+Y p*(y) In any (3C.27)
y
where
P*(y) = > p(y|x)a*(x)
and
P™(y) = > p(y|x)q™(x)
Again using inequality (1.1.8), we have
p*(y)
p*(y) In > 0 3C.28
2 (y) p(y) (3C.28)
and, from (3C.27)
q’* (x)
C — C(k + 1) )<Qq q*(x) In Fx) (3C.29)
Noting that C > C(k + 1) and summing (3C.29) over k from 0 to N — 1, we
have
ie a(x)
2X le— ek +1) =)" q*(x
; g(x)
(3C.30)
Again from inequality (1.1.8), we have
d q*(x) In q(x) < ¥° q*(x) In g*(x) (3C.31)
x
212 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and thus
ZIC- CW = Eat)» Fo
The upper bound on (3C.32) is finite and independent of N. Hence
{|C — C(k)|}@1 is a convergent series, which implies
lim |C — C(k)| =0 (3C.33)
k->o
(3C.32)
Similar efficient computational algorithms have been developed for the ex-
purgated exponent E,,(R) (Lesh [1976]) given by (3.3.13) and (3.3.14) and for the
sphere-packing exponent E,,(R) (Arimoto [1976], Lesh [1976]) given by (3.6.47).
Recall that the ensemble average exponent equals the sphere-packing exponents for
higher rates and is easy to derive from E,,(R).
PROBLEMS
3.1 Compute E,(1, q) and E,(1) = max, E,(1, q) for each of the following channels:
a, @
(a) BSC (b) BEC (c) Z channel
Figure P3.1
3.2 (a) Compute max, E,(p, q) = E,(p) and C = max, /(q) for the following channels.
(b) Compute E(R) for each channel.
Hint: Check conditions (3.2.23) for the obvious intuitive choice of q.
a\@ eb,
ad a p by
ay
Figure P3.2
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 213
3.3 (a) Compute E,(p), C, and E(R) for the Q-input, Q-output noiseless channel
p(b,| a4) =1
3 k= 1, y a oa
p(b,|a;)=0 forallk#i | Q
(by Compute E,(p), E,(p), E,(1), and C = lim E/(p) for the Q-input, Q-output channel
p70
p(b,|a;)=p foralli#k
P(b,|4,)=P where p+(Q-1)p=1
Do not compute E(R), but sketch it denoting key numerical parameters.
(c) Find the optimizing q and sketch E(R) for the six-input, four-output channel.
P(b,|a,)=0.01 i#k
i,k #:12,8,4
Figure P3.3
Hint: Show that (c) can be regarded as the superimposition of (b) with Q = 4 and (a) with Q = 2.
3.4 (a) Show that, for a Q-input, J-output memoryless channel, the necessary and sufficient condition
(3.2.23) on the input distribution q which maximizes E,(p, q) can be stated in matrix form as follows
a[PYO*?] >(exp[—E,(p))Ju where a” = [PY +g?
with equality for all k for which gq, #0, where we have used the notation
q~=d@a,) jj 12,2638
Pa=pbjla) k= 1,2,...,0
M=(01,02,...,09) oF = (af, of,..., of)
u=(1,1,...,1) E,(o0)=max E,(p, q) (scalar)
— Q = q
and X’ is the transpose of X.
(b) Under the following conditions
(i) J=Q
(ii) det [PX *?] #0
(iii) q,>90 forallk
show that
(1) Fl yr — u[ PY shal its i= t°
(2) Fle PyT — EF
and consequently
T —E a
q' =e aig is Saad Ad
214 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(3) Applying the constraint equation (u, q) = 1, show that
E,(p) = p In {uP} *6"}
3.5 Apply the results of Prob. 3.4 to obtain, for the channel of Prob. 3.1c with p = 4
(a) E,(p) = p In [1 + 240+” — 1) 4017)
(b) Find q in terms of p and thus show that the optimizing distribution varies with R. Indicate
specifically q|,-9 and q|,-1.
(c) Find C.
(d) Sketch E(R).
3.6 For the Q-input, (Q + 1)-output “erasure” channel with
p(bj|aq,)=0 jf#kK jf=1,2,...Q k=1,2,...0
(a) Determine the maximizing distribution q for all p € [0, 1].
(b) Determine E,(p), E,(o) and E(R) explicitly and sketch E(R).
3.7 For all three channels of Prob. 3.1, determine
(a) E,(p) = max, E,(p, 4)
(b) E,(p)
(c) E,,(R) and sketch
3.8 (a) For the channel of Prob. 3.3a, determine E,(p) and E,,(R).
(b) Repeat for the channels of Prob. 3.2 and discuss the difference in the results of (i) and (ii).
3.9 For the channel of Fig. 3.7
(a) Find the maximizing q, E,(p), and C. Sketch E(R).
(b) Find E,(p, q) using the same q as in (a). Sketch E,,(R, q) on the same diagram as (a).
3.10 For any distribution q, show that
0E,(p, 4)
dp p=1 (ge |
0E.(p,4)| —__ E,(p, 4)
6p \p=1 = OP
(a) < E,(p, q)
(b) with equality iff C = 0
p=1
3.11 (a) Show that if the Q x Q matrix with elements
1/p
x, x'e £ = {a,°-- ag}
Ox = b J Ply|x)p(y|x’)
is nonnegative definite, then the function
fa) =L YE aa)
1/p
Y Vp(y|x)p(y|x’)
is convex VU in the probability distribution space of q (Jelinek [19685]).
(b) Obtain necessary and sufficient conditions on q to minimize f, and consequently maximize
E,(p, q), for any channel satisfying (a).
3.12 (a) For the binary-input AWGN channel and for the BSC derived from it by hard quantization
of the channel output symbols, verify (3.4.19) and (3.4.20).
(b) Verify Fig. 3.8 (a), (b), (c) in the limit as &,/N, > 0 and &,/N, > ©.
(c) Verify (3.4.21) and obtain curves for the octal output quantized AWGN channel (for the
quantizer of Fig. 2.13, let a= 4./N,/2).
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 215
3.13 Show that the AWGN channel with &,/N, < 1 satisfies the definitions (3.4.23) and (3.4.24) of a
very noisy channel.
3.14 (Parallel Channels) (Gallager [1965]) Let the independent memoryless channels 1 and 2 with
identical input and output alphabets be used in parallel. That is, for each symbol time, we send a
symbol x over channel 1 and simultaneously a symbol z over channel 2.
(a) Treating these parallel channels as a single composite channel, show that for the composite
E,(p, q) ae E,,(p, q;) “2 E,,(p, q>)
where the subscripts 1 and 2 refer to the corresponding exponent function for the individual channels
and q = (q,, q,) is a 2Q-dimensional vector where q, and q, are each Q dimensional.
(b) Show then that
max E,(p, q) = max E,,(p, q,) + max E,,(p, q2)
q q, 4,
3.15 (Sum Channels) (Gallager [1968]) Suppose we have n independent memoryless channels, possibly
with different input and output alphabets. At each symbol time, a symbol is sent over only one of
the channels. We call this a sum channel.
Let £,(9)=max E,(p,q) for the ith channel, i = 1, 2,...,n
E,(e) = max E,(p,q) —‘ for the sum channel
q
B(i) = Pr {using ith channel}
isa if the weighting vector for the ith channel is (q{’, ..., gi) = q', the sum channel weighting is
= (B(1)q™, B(2)a, ..-, B(n)q®).
(a) Show that
n
eFleiie — z. eke. de)/e
i=1
o, i(p)
and pec ee
¥ elBa.toviol
i=1
(b) Show from this that the sum channel capacity C is related to the individual channel capacities
C; by
(c) Apply these results to obtain E,(p) and C for the channel of Fig. 3.7.
3.16 (List Decoding) Suppose the decoder, rather than deciding in favor of a single message m, con-
structs a list m,, m,,...,m, of the L most likely messages. An error is said to occur in list decoding if
the correct message is not in this list.
(a) Show that, for a memoryless channel
» Prly | Xn)
yYeAn
wher. oo te. Py(¥ | Xm,)
a An ae Pyly | Xm)
(b) Using the techniques of Sec. 2.4, show that
> 1 for some set of L messages m,, m,, ..., m, where m, + m for all <
ef... f Py(y |x Pwl¥ |Xm)
|
£ _ xn = |p 1 esp ‘| a
216 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
and
Py, < bs Pwl¥ | Xm) fy(Y)
where
fisb=} EE i eae ee
1#m = myuFxm l=1 Pn (y|x,,)
and thus that, with 2 =
1+ pL
. p
aw Miers eee? [Paty bem)! *02
7
m=m myet=m l=1
(c) Now applying the techniques of Sec. 3.1, obtain an ensemble average bound
L\p
| ee a ae 1 )Pw(y|X;) site? ee ie Ip Qn (x)py(y |x) yi +b)
y Xi
and, since (¥7') < (M — 1)", show that
P.< YY. an(x)pn(y |x)!" vlad =F an(xon(y |x) *°9) TAPS Ss
ae a NIE,(0, q)- pR]
where p = pLsothatO0<p<L
(d) Compare this result, after maximizing with respect to q, with the sphere-packing lower
bound.
3.17 Find p(s) of (3.5.3) for the two N-symbol code vectors (a, a, ..., a) and (b, b, ..., b) for each
of the following channels
oy’.
nor os
p
p
1 ates
be = :
(a) BSC (b) Z channel (c)
Figure P3.17
3.18 (Chernoff Upper Bound on the Tail of a Distribution)
(a) Ifnis an arbitrary random variable with finite moments of all orders and 0 is a constant, show
Pring > Oc. Be" 34) 2320
sah el (s) —s0
where
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 217
(b) Show that minimizing on s results in
Pr {n > 6} < el (s)—sT'(s)
av
where 0 = I''(s) = wi
ds
Hint: Show first that I'(s) — s@ is convex U by comparing it to p(s) of (3.5.3).
(c) Let
N
n= 2 Vs
n=1
where the y,’s are independent identically distributed random variables. Verify (3.5.38)
Pr {n > 0} < eN@)~ 27°
where
i(s) =") = nS ph)
3.19 (a) Apply Prob. 3.18c to the binomial distribution by letting
x 0 with probability 1 — p
n= >) y, where y, =
n=1
1 with probability p
Obtain upper bounds on Pr (my > ND) and Pr (y < ND):
[Pr(1> ND), p<D
HE(D) .ND(y __ (1-D)
tite dhs ioe 8 ~\Pr(n< ND), p>D
Hint: For p > D, replace n by N — n, p by 1 — p, and D by 1 — D.
Also show that, when p = 4 and D < $
e~ NLRC) + o(N)) <Pr {n “ ND} < e~ NR(D)
where
R(D) = In 2 — #(D)
(b) Apply Prob. 3.18 to the Gaussian distribution showing that, if 7 is a zero-mean unit variance
Gaussian random variable
Pr {n > 6} = (0) <e-*”?
3.20 Find the sphere-packing bound for all the channels of Probs. 3.2, 3.3, 3.5, and 3.6.
3.21 Alternative proof of the expurgated bound: For any DMC channel, the expurgated bound given
by Theorem 3.3.1 can be proven using a sequence of ensemble average arguments rather than expurgat-
ing codes from an ensemble as is done in Sec. 3.3.
Begin with a code of block length N and rate R = (In M)/N given by @ = {x,, x2, ..., Xy}. The
Bhattacharyya bound of (2.3.16) gives
Pe (€)< ¥ 1¥ \/pwly|Xm)Pwly|Xm) | = Bu (F)
m' =m y
for m = 1, 2, ..., M.
218 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(a) Show that for any s € (0, 1]
Ble) < 5 Vow Papa ee) |
for mi = 1, 2, ..., M.
(b) Consider an ensemble of codewords where any codeword x is chosen with probability
Assume that the M — 1 codewords {x,,-}.,+m are fixed and average BS (@) with respect to codeword x
chosen from the above ensemble. Denote this average by B*,(@) and show that
m(6) < M[y(s, )]*
where
y(s, 4) = max )) ato(5 / Ply|x)p(y|x’) )
q(x’)>0
[Here, without loss of essential generality, assume that all codewords in @ satisfy qy(x) > 0.]
(c) Given code @, show that there exists a codeword X,, such that a new code @,, which is the
same as @ with x,, replaced by X,, satisfies
Py (€Cm) < By(Em) < M*[y(s, q)]** — for any s € (0, 1]
(d) Using (c), construct a sequence of codes
€o os {X1, X2, Xu}
6 {X,, X2, X3, Xu}
A ss Re Ba ee
Cet 8 Ra ee)
such that
Py (Gm) < By (Gm) < M*[y(s, q)}*
where
Bn m) a ost n)Pn Tm) | + ae s/ Pry |Xm)Pw(¥ |[Xm’) |
m d y m pays
for m= 1, 2,....,-M:
(e) For code @y = {X,, X,, ..., Xy}, Show that
B,,(€u) ~ B,(Gm) + d J Pn( Y |Xm)Pwy |X’ )
m De
for me 1 2
(f) For code @y,
Pr (@m) Se B,,(@ mu)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 219
and the average error probability is defined by
Show that for any s € (0, 1]
P,(€ 4) < 2M*[y(s, q)}"
(g) By examining necessary conditions for achieving a minimum, show that, over all probability
distributions on ¥
min) ) atxiaee(S J polseo1)} = min }(s, q)
re y q
This then proves that, for any distribution q(-) and any p = 1/s € [1, 00), there exists a code € of
block length N and rate R such that
1/p\Np
x </P(y|x)p(y|x’)
P(€)< 2M) 3 q(x)q(x’)
= Qe MEx(. a)— pR}
where
By maximizing the exponent with respect to the distribution q and the parameter p € [1, 00), obtain
the expurgated bound. Note that this proof does not give the inconvenient term (In 4)/N added to the
rate R as does that in Theorem 3.3.1.
3.22 Discrimination functions and the sphere-packing bound: The sphere-packing lower bound can be
proven for discrete memoryless channels using discrimination functions (see Omura [1975]). Here this
approach is demonstrated for the BSC channel with crossover probability p.
Define a “dummy” BSC with crossover probability p and capacity C = In 2 — #(p). The dis-
crimination between the actual channel and the dummy channel is defined in terms of channel
transition probabilities as
2 Pls)
J(B, p p(y |x
= d Pty |x) ” ply |x)
meple (eo
p —p
For any y > 0 and any x € 2, define the subset G,(x) < Wy as follows:
Py(y|x)
Pyly |x)
(a) For any code @ = {x,, x2, ..., Xy} of block length N and rate R = (In M)/N, show that
G00) = fy: rip) =>
1
oi deed z= d Pwl¥ | Xm)
mat yeAa
‘ Se
> e NU@.P+n—_ > ~ Ply |Xm)
m=1 ye An 7 Gj xm)
e Nis(d, p)+ 7} )
lee ES Palylx)
m=1 yeG,(xm)
where P, is the error probability when code @ is used over the dummy BSC.
220 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(b) Show that
ot. p
ES Pal Xn) = Br[ + im Pelt Pew
erentre J (P,P) 2
ye G,(xm) N Pyly | —
goes to 0 as N> o.
(c) Using the converse to the coding theorem for p chosen such that
C=In2— #(p)<R
show that there exists an « > 0 such that for N large enough
P; >a eo NtJ(b, p) +7}
Since this is true for any y > 0 and dummy BSC where C < R, define the limiting exponent
E,,(R) = J(p, p)
where ? satisfies
R
In 2 — #(p)
and check that this is the sphere-packing exponent for the BSC.
3.23 Consider sending one of two codewords, x, or x,, over a BSC with crossover probability p. Using
the method of Prob. 3.22 where M = 2 and p = 4, show that for any y > 0 and for all N large enough
P,(1 > 2) > 4 exp [—w(x, ® x2){—In \/4p(1 — p) + y}]
where w(x; ® x2) is the Hamming distance between the two codewords. [For large N we assume
w(x; ® x2) is also large.]
Hint: Consider only those coordinates where x,, # X,,,n = 1, 2,..., N.
3.24 For the unconstrained AWGN channel, prove the sphere-packing lower bound on P, for any
code @ = {x,, X,,..., Rpg} With
ix f* =¢ m= 1,2,...,M
by following the method of Prob. 3.22. Here use the “dummy” AWGN channel that multiplies all
inputs by p. That is, for codeword x € 2, the dummy AWGN channel has transition probability
density .
1 N/2
e7 l¥—exll?/Ne
ity) = (=
o
whereas the actual channel has transition probability density
1 \N/2
Py(y|x) = (=| on lly xll2/Ne
3.25 Consider AWGN channels that employ m frequency-orthogonal signals such as in (2.12.1). These
are commonly called MFSK signals. Show that, for these M-ary input memoryless channels, the
expurgated function E,(p) = max, E,(p, q) has the form
1+(M — gt)
E,(p) = —p In | a
where
Z=Y V/ply|x)pvy|x’)
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 221
for any x # x’. Find Z for the following cases:
(a) Coherent channel with hard M-ary decision outputs.
(b) Noncoherent channel with hard M-ary decision outputs.
(c) Coherent channel with unquantized output vectors.
(d) Noncoherent channel with unquantized output vectors.
Show that the expurgated exponent D = E.,(R) satisfies
R = In M — #(D/d) — (D/d) In (M — 1)
where d = —In Z.
3.26 Suppose we have a DMC with transition probabilities p(y |x). The decoder mistakenly assumes
that the transition probabilities are p(y|x) and bases the maximum likelihood decision rule on these
incorrect transition probabilities. Following Secs. 2.4 and 3.1, derive an ensemble average upper bound
to the probability of a decoding error (Stiglitz [1966]).
It should have the form
P, <exp{—N[—pR + F(p, 4, p,p]} O<p<t
The quantity
R,(p, p) = max F(1, q, p, p)
7“ q
can be used to examine the loss due to not knowing the actual channel parameters.
3.27 Repeat Prob. 3.25 for the noncoherent fading channels with MSFK signals that are discussed
in Sec. 2.12.3.
3.28 Suppose we have a DMC with input alphabet % containing Q symbols. Let
d(x, x’) = In ¥ \/p(y|x)p(y|x’)
satisfy the “ balanced channel” condition
{d(x, x’): x’ € #} = {d,, d,,..., do}
for all x € &. This shows that the set of Bhattacharyya distances from any input x to all other inputs are
the same for all x e 2. For these channels, show that the expurgated exponent D = E.,(R) is given
parametrically by
for s = —1/p € [—1, 0]. Give the specific form of these equations for the multiphase signal set of
Fig. 2.12b used over the AWGN channel.
3.29 Consider a DMC with input alphabet 2, output alphabet ¥, and transition probabilities
{p(y|x):x € &, ye Y}. Given a code @ = {x,,X,, ..., X,g} of block length N and rate R = (In M)/N,
following the proof of Theorem 3.9.1,
(a) Show that the probability of correct decoding is
1
Laer? ene Py(¥ | Xm)
(Assume all messages are equally likely and that the optimum decision rule is used.)
222 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
(b) For any B > 0 show that
1 < 1/p f
Le Srs> D Pry |Xm)
y \m=1
(c) Consider an ensemble of codes where code @ = {x,,X,, ..., Xy} is selected with probability
Q(¢) = [] av(%n)
where
and q(-) is any distribution on 2. For 0 < B < 1, show that P, averaged over this code ensemble
satisfies
Po se NEde.@-0R} = =—where §=op = B—1e€[-1, 0]
1+p
and Ep, 4) = In 5 (¥ aldo xy”)
y ~
(d) Show that, for some distribution q(-) and p € [—1, 0]
E,(p,q) — pR>0 forR>C
Here C is channel capacity for the DMC.
(e) From (d), it follows that, over the ensemble of codes of block length N and rate R with some
distribution q(-), the average probability of correct decoding satisfies
ik NE;-(R)
Pose
where
E,(R)= max {E,(p, 4) — pR} > 0
As psO
for R > C. Compare this result with the strong converse coding theorem in Sec. 3.9. What is the
difference between these results? Explain why the above result is not useful.
3.30 Consider the K-input, K-output DMC where
P(b,|a,) = 1—p k=1,2,...,K
Pf, ,|\a)=p kt, 2,.4.,K—-1
and
P(b, ja,j=p ~ where O< p<}
(a) Find E,() = max E,(p, q) and E,(p).
q
(b) Find channel capacity.
(c) Suppose codeword x, which has N components gives an output y. We now randomly select
x, according to the probability
1
Qn(X2) = KN
What is the probability that x, is chosen such that it is possible for x, also to give output y?
(d) Suppose x,, x3, ..., Xy4 are randomly selected as in (c). Find a union upper bound for the
probability that one or more of the codewords x,, x;,..., X can give output y.
BLOCK CODE ENSEMBLE PERFORMANCE ANALYSIS 223
(e) Determine the ensemble average exponent E(R) for the case where p = 4, and compare this with
the exponent in the bound obtained in (d).
(f) Determine E(R) for p = 0 and explain why it is finite.
3.31 Consider M signals and an additive white Gaussian noise channel with spectral density N,/2. The
signal set is
N
x{t)= } xa¢,(t) O<t<T,i=1,2,...,M
k=1
where {@,(-)}{=1 is a set of orthonormal functions. Suppose we now randomly select codewords by
choosing each x;, independently from the ensemble of random variables with zero mean and variance &.
Using a union of events bound, show that there exists a set of codewords such that the error
probability satisfies
P< M2.
Find C, when
(a) x is a Gaussian random variable.
oo \+./& — with probability 3
ee \-/é with probability >
Hint: Assume x,(t) is sent and bound the error probability by the sum of the two signal error
probabilities between x,(t) and each of the other signals. Then use the bound
[ gs e-©? dg xe" ="
i. ata
and average the bound over the ensemble of codewords.
3.32 Consider the four-input, four-output DMC shown.
(a) What is the channel capacity?
(b) Determine E,(p) = max E,(p, q).
q
(c) Determine and sketch E(R), E.,(R), and E,,(R).
0 0
]
ook
Pr =O Allk
2 2
3 3
Figure P3.32
3.33 (Improved Plotkin Bound) Assume a systematic binary linear code of M = 2* code vectors of
dimensionality N. Let d,;, be the minimum distance between code vectors in this code.
(a) For any 1 <j < K, consider the 2/ code vectors in the code with the first K — j information
bits constrained to be 0. By eliminating these first K — j components in these code vectors, a binary
code of 2/ code vectors of dimensionality N — (K — j) is obtained. Use Lemma 3.7.1 to show that
gt 2
2 27-1
min —
d
(b) Next, show the improved Plotkin bound
1
nin
= <5 (l — R/In 2) + o(N)
224 FUNDAMENTALS OF DIGITAL COMMUNICATION AND BLOCK CODING
where
R= (In M)/N
(c) Show that the improved Plotkin bound is valid for all binary codes of M code vectors of
dimensionality N.
3.34 (Gilbert Bound for Binary Codes)
1. List all 2% possible distinct binary vectors of length N.
2. Choose an arbitrary binary vector from this list and denote it as x,. Delete from the list x, and all
other binary vectors of distance d — 1 or less from x,.
3. From the remaining binary vectors on the list arbitrarily pick x, , then delete from the list x, and all
other binary vectors that are distance d — 1 or less from x,.
4. Repeat Step 3 for vectors x3, X4,..., Xy until the list is empty.
(a) Show that the number of binary vectors selected, M, satisfies
i=0
(b) Using the Chernoff bound (see Prob. 3.19a), show that
a—3
+ ()pi(1 a p)-# < eN dN) pd ce p)y-4
i=0
and, choosing p = 4, show
> eg < eN#dIN)
i=0
(c) From (a) and (6), show that, for any rate R = (In M)/N <1n2, there exists a code of
minimum distance d,,;, where
and 6 satisfies 5 < 4 and
R=In2—- #(6)
This is the Gilbert bound on d,,,,.
(d) Rederive the Gilbert bound for large N by using the expurgated upper bound (3.4.8) and the
lower error bound (3.7.18) for the binary-input, output-symmetric channel. Furthermore, show that
the Gilbert bound holds for linear codes as well by using the expurgated upper error bound derived in
Sec. 3.10.
PART
TWO
CONVOLUTIONAL CODING AND
DIGITAL COMMUNICATION
TST oe
Bay
vee
CHAPTER
FOUR
CONVOLUTIONAL CODES
4.11 INTRODUCTION AND BASIC STRUCTURE
In the two preceding chapters we have treated digital communication over a
variety of memoryless channels and the performance enhancement achievable by
block coding. Beginning with the most general block codes, we proceeded to
impose the linearity condition which endowed the codes with additional structure,
thus simplifying both the encoding-decoding procedure and the performance
analysis for many channels of interest. Of particular significance is the fact that,
for a given block length and code rate, the best linear block code performs about
as well as the best block code with the same parameters. This was demonstrated
for a few isolated codes and channels in Chap. 2, and more generally by ensemble
arguments in Chap. 3.
In the narrowest sense, convolutional codes can be viewed as a special class of
linear block codes, but, by taking a more enlightened viewpoint, we shall find that
the additional convolutionai structure endows a linear code with superior proper-
ties which both facilitate decoding and improve performance. We begin with the
narrow viewpoint, mainly to establish the connection with previous material, and
then gradually widen our horizon. In this chapter and the next, we shall exploit
the additional structure to derive a maximum likelihood decoder of reduced
complexity and improved performance, first for specific codes and channels and
then more generally on an ensemble basis, following essentially the outlines used
for block codes in Chaps. 2 and 3. Finally, in Chap. 6, we treat sequential decod-
ing algorithms which reduce decoder complexity at the cost of increased memory
and computational speed requirements.
227
228 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Consider first the linear block code specified by the binary generator matrix
Ho ce a (K) :
Se ee gf)
(3) eS wy (K +1)
80° 81-82 £k-1
G= gi) 4: aig (4.1.1)
B= K-11) B
| ef ar or
where g*) = (g'} g% --- g%) is an n-dimensional binary vector and blank areas in
the matrix G indicate zero values. G describes an (nB, B — K + 1) linear block
code which could be implemented, as shown in Fig. 2.16, by a (B — K + 1)-stage
fixed register and nB modulo-2 adders. A simpler mechanization, particularly
since generally B > K, utilizes a K-stage shift register with n modulo-2 adders and
time-varying tap coefficients g‘?, as shown in Fig. 4.1. The shift register can be
viewed either as a register whose contents are shifted one stage to the right as each
new bit is shifted in from the left, with the rightmost stage contents being lost, or
as a digital delay line in which each delay element stores one bit between arrival
times of the input bits. Both representations are shown in Fig. 4.1, with the former
shown dotted, the latter being the preferred form.
We note also that in the shift register or delay line implementation the
I me
: | a |
S: Es ae
(7)
|
OF) -
L® -
Bs Nha
Figure 4.1 A time-varying convolutional encoder: rate r = 1/n bits/channel symbol.
CONVOLUTIONAL CODES 229
(B — K + 1)st (last) bit must be followed by K — 1 zeros to clear the register’ and
to produce the last K — 1 output branches, which are sometimes called the tail of
the code.
Thus it appears that the encoder complexity is independent of block length
nB, and depends only on the register length K, and the code rate,* which, when
measured in bits per output symbol, approaches 1/n as B— oo. K is called the
constraint length of the convolutional code. On the basis of the shift register im-
plementation it should also be clear that the greater the ratio B/K, the less the tail
“overhead” in the sense that, since the last K — 1 input bits are zeros, the tail
reduces the code rate in proportion to (K — 1)/B.
The term “convolutional” applies to this class of codes because the output
symbol sequence vy can be expressed as the convolution of input (bit) sequence u
with the generator sequences. For, since the code is linear, we have
¥="G
and, as a consequence of the form of G of Eq. (4.1.1)
os y ete Digiigee 1 Sale Naga (4.1.2)
k=max(1, i—K + 1)
where V; = (v;1, V;2, ---, Vin) is the n-dimensional coder output just after the ith bit
has entered the encoder.
While, for theoretical reasons, in the next chapter we shall be interested in
the ensemble of time-varying convolutional codes just described, virtually all
convolutional codes of practical interest are time-invariant (fixed). For such
codes, the tap coefficients are fixed for all time, and consequently we may delete
all superscripts in the matrix (4.1.1) with the result that each row is identical to
the preceding row shifted n terms to the right. An example of a fixed convolutional
code with constraint length K = 3 and code rate $ is shown in Fig. 4.2a. Here
the generator matrix of (4.1.1) has the form
2 ies Et Ber ee
' This is required to terminate the code. Alternatively, the convolutional code may be regarded as a
long block code (with block length nB arbitrarily large) and this termination with (K — 1) zeros clears
the encoder register for the next block.
* The code rate is actually [1 — (K — 1)/B]/n because of the (K — 1) zeros in the tail. However we
generally disregard the rate loss in the tail since it is almost always insignificant, and henceforth the
rate shall refer to the asymptotic rate; that is to the ratio of input bits to output symbols, exclusive
of the tail (1/n in this case).
230 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The rate of the class of convolutional codes defined in this manner is 1/n bits
per output symbol.* To generalize to any other rational rate less than unity,
we must generalize the matrix G of (4.1.1) or its implementation in Fig. 4.1. We
may most easily describe higher code rate convolutional codes by specifying that
b > 1 bits be shifted together in parallel into the encoder every b bit times, and
simultaneously that the bits already within the encoder are shifted to the right
in blocks of b. Here K is the number of b-tuples in the register so that a total
of bK bits influence any given output and consequently bK is now the constraint
length. In terms of the generator matrix (4.1.1), we may describe a convolutional
code of rate b/n by replacing the n-dimensional vector components g!) with
b x n matrices. The implementation of Fig. 4.1 can best be generalized by
providing b parallel delay lines, every stage of each delay line being connected
through a tap multiplier to each modulo-2 adder. Examples of fixed convolutional
codes of rates 4 and ? with K = 2 are shown in Fig. 4.2b and c, respectively.
The generalization to time-varying convolutional codes of any rate b/n and any
constraint length K is immediate.
A fixed convolutional coder may be regarded as a linear time-invariant finite-
state machine whose structure can be exhibited with the aid of any one of several
diagrams. We shall demonstrate the use and insight provided by such diagrams
with the aid of the simple example of Fig. 4.2a. It is both traditional in this field
and instructive to begin with the tree diagram of Fig. 4.3. On it we may display
both input and output sequences of the encoder. Inputs are indicated by the path
followed in the diagram, while outputs are indicated by symbols along the tree’s
branches. An input zero specifies the upper branch of a bifurcation while a one
specifies the lower one. Thus, for the encoder of Fig. 4.2a, the input sequence 0110
is indicated by moving up at the first branching level, down at the second and
third, and again up at the fourth to produce the outputs indicated along the
branches traversed: 00, 11, 01, 01. Thus, on the diagram of Fig. 4.3, we may
indicate all output sequences corresponding to all 32 possible sequences for the
first five input bits.
From the diagram, it also becomes clear that after the first three branches the
structure becomes repetitive. In fact, we readily recognize that beyond the third
branch the code symbols on branches emanating from the two nodes labeled a are
identical, and similarly for all the identically labeled pairs of nodes. The reason for
this is obvious from examination of the encoder. When the third input bit enters
the encoder, the first input bit comes out of the rightmost delay element, and
thereafter no longer influences the output code symbols. Consequently, the data
sequences 100xy... and 00Oxy ... generate the same code symbols after the third
branch and thus both nodes labeled a in the tree diagram can be joined together.
This leads to redrawing the tree diagram as shown in Fig. 4.4. This new figure
has been called a trellis diagram, since a trellis is a tree-like structure with remerg-
> We use small r to denote code rate in bits per output symbol; that is, when we use the logarithm
to the base 2 to define rate.
CONVOLUTIONAL CODES 231
se
4 +
8 =, 1) ie are |
ae 50 : l i
fh be | 0 1
a eG
(a) K=3,r=}
(b) K =2,r=2/3
p= CO GO
roe
|
[, ] [ ]
oO © © oo-=
joe J ome ca Oo — ©
O-@
oO
(c) K=2,r=3/4
Figure 4.2 Fixed convolutional encoder examples.
ing branches. We adopt the convention here that code branches produced by a
“Q” input bit are shown as solid lines and code branches produced by a “1” input
bit are shown dashed. We note also that, since after B — K + 1 input bits the code
block (4.1.1) is terminated by inserting K — 1 zeros into the encoder, the trellis
terminates at an a node as shown in Fig. 4.4. The last two branches are then the
tail of the code in this case.
The completely repetitive structure of the trellis diagram suggests a further
reduction of the representation of the code to the state diagram of Fig. 4.5. The
states of the state diagram are labeled according to the nodes of the trellis
diagram. However, since the states correspond merely to the last two input bits to
the coder, we may use these bits to denote the nodes or states of this diagram.
232 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
00
1
—
&
iH
ior)
i
So
ll
=][s][e][8]
— © —
=
I
Figure 4.3 Tree-code representation for encoder of Fig. 4.2a.
00
00 af
11
00 a,
. 10
1] 7
| 01
00
11
10 c |
WER tee,
11
01
01 dj
+ 46
00
11 a
| 11
10
10
00 b
| Ol
11
11
01 c
+ 00
01
: 01
10
10
00
00 a
1
1]
10
11 b
|} Ol
10
11
10 c
-; 00
00
01
01 d|
|_ 10
00
11 a
ie e
01
10
00 b
a
01
11
01 c
_ 00
10
01
10 d |
| 10
CONVOLUTIONAL CODES 233
00. 00 00 i
a=
\ \
MIT N11
‘ x
\
Figure 4.4 Trellis-code representation for encoder of Fig. 4.2a.
Throughout the text, we shall adopt this convention of denoting the state of a rate
1/n convolutional encoder by the latest K — 1 binary symbols in the register with
the most recent bit being the last bit in the state.
We observe finally that the state diagram can be drawn directly by observing
the finite-state machine properties of the encoder and particularly by observing
the fact that a four-state directed graph can be used to represent uniquely the
input-output relation of the K = 3 stage machine. For the nodes represent the
previous two bits, while the present bit is indicated by the transition branch; for
example, if the encoder contains 011, this is represented in the diagram by the
10
ae
a-[11]
01
Figure 4.5 State diagram for encoder of
Fig. 4.2a.
234 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Figure 4.6 State diagram for encoder of
Fig. 4.2b.
transition from state b = 01 to state d = 11 and the corresponding branch indi-
cates the code symbol outputs 01.
To generalize to rate b/n convolutional codes, we note simply that the tree
diagram will now have 2° branches emanating from each branching node.
However, the effect of the constraint length K is the same as before, and hence,
after the first K branches, the paths will begin to remerge in groups of 2’; more
precisely, all paths with b(K — 1) identical data bits will merge together, produc-
ing a trellis of 2"*~ ") states with all branchings and mergings occurring in groups
of 2’ branches. Here K represents the number of b-tuples stored in the register.
Consequently, the state diagram will also have 2*~ 1) states, with each state
having 2°” output branches emanating from it and 2’ input branches arriving into
it. An example of a state diagram for the rate } code of Fig. 4.2b is shown in
Fig. 4.6. Other examples are treated in the problems.
Up to this point in our treatment of nonblock codes, we have only considered
linear codes. Just as linear block codes are a subclass of block codes, convolu-
tional codes are a subclass of a broader class of codes which we call trellis codes.
Rate b/n trellis encoders also emit n channel symbols each time b source bits enter
the register. However, general trellis encoders can produce symbols from any
channel input alphabet, and these symbols may be an arbitrary (nonlinear) func-
tion of the bK source bits in the encoder register. Since the K-stage register is the
same for the general class of trellis codes as for convolutional codes, the tree,
trellis, and state diagrams are the same and the trellis encoder output symbols can
be associated with branches just as was done previously for the subclass of convo-
lutional codes. It is clear that general trellis codes have the same relationship to
general block codes as convolutional codes have to linear block codes.
We have seen here that the tree, trellis, and state diagram descriptions of
convolutional and trellis codes are quite different from our earlier description of
block codes. How then do we compare block codes with convolutional codes?
CONVOLUTIONAL CODES 235
Returning to our earlier discussion on the generation of convolutional codes, we
see that the parameters bK, the constraint or “memory” length of the encoder,
and r = b/n, the rate in bits per channel symbol, are common to both block and
convolutional encoders. For both cases, the same value of these parameters result
in roughly the same encoder complexity. We shall soon see that the complexity of
a maximum likelihood decoder for the same bK and r is also roughly the same for
block codes and convolutional or trellis codes. Hence, for the purpose of compar-
ing block codes and convolutional codes, we use the parameters bK and r. We
shall see that, for the same parameters bK and r, convolutional codes can achieve
much smaller error probabilities than block codes.
We began the discussion in this section by viewing convolutional codes as a
special case of block codes. By choosing K = 1 and n = N in the above, we get a
rate b/N block code, and thus paradoxically linear block codes can themselves be
considered special cases of convolutional codes, and the broader class of block
codes can be considered special cases of trellis codes. It is a matter of taste as to
which description is considered more general.
4.2 MAXIMUM LIKELIHOOD DECODER FOR
CONVOLUTIONAL CODES—THE VITERBI ALGORITHM
As we have seen, convolutional codes can be regarded as a special class of block
codes; hence the maximum likelihood decoder for a convolutional code, as
specified by (4.1.1), can be implemented just as described in Chap. 2 for a block
code of B — K + 1 bits, and will achieve a minimum block error probability for
equiprobable data sequences. The difficulty, of course, is that efficient convolu-
tional codes have a very large block length relative to the constraint length K, as
discussed in the preceding section; in fact, rarely is B less than several hundred,
and often the encoded data consists of the entire message (plus final tail). Since the
number of code vectors or code paths through the tree or trellis is 2%8~**!)a
straightforward block maximum likelihood decoder utilizing one decoder element
per code vector would appear to be absurdly complex. On the other hand, just as
we found that the encoder can be implemented with a complexity which depends
on K — 1 rather than on B, we shall demonstrate that the decoder complexity
need only grow exponentially with K — 1 rather than B. For the sake of simple
exposition, we begin this discussion by treating the K = 3, rate = 1/2 code of
Fig. 4.2a, and we assume transmission over a binary symmetric channel (BSC).
Once the basic concepts are established by this example, the maximum likelihood
decoder of minimum complexity can be easily found for any convolutional code
and any memoryless channel.
We recall from Sec. 2.8 that, for a BSC which transforms a channel code
symbol “0” to “1” or “1” to “0” with probability p, the maximum likelihood
decoder reduces to a minimum distance decoder which computes the Hamming
distance from the error-corrupted received vector y,, y2,..., yj, --. to each pos-
236 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
sible transmitted code vector x,, xX 2,...,X,,... and decides in favor of the closest
code vector (or its corresponding data vector).
Referring first to the tree diagram code representation of Fig. 4.3, we see that
this implies that we should choose that path in the tree whose code sequence
differs in the fewest number of symbols from the received sequence. However,
recognizing that the transmitted code branches remerge continually, we may
equally limit our choice to the possible paths in the trellis diagram of Fig. 4.4.
Examination of this diagram indicates that it is unnecessary to consider the entire
received sequence of length nB (n = 2 in this case) in deciding upon earlier seg-
ments of the most likely (minimum distance) transmitted sequence, since we can
eliminate segments of nonminimum distance paths when paths merge. In particu-
lar, immediately after the third branch we may determine which of the two paths
leading to node or state a is more likely to have been sent. For example if 010001 is
received, then since this sequence is at distance 2 from 000000 while it is at
distance 3 from 111011, we may exclude the lower path into node a. For, no
matter what the subsequent received symbols will be, they will affect the distances
only over subsequent branches after these two paths have remerged, and con-
sequently in exactly the same way. The same can be said for pairs of paths merging
at the other three nodes, b, ¢ and d, after the third branch. Of the two paths
merging at a given node, we shall refer to the minimum distance one as the
survivor. Thus it is necessary to remember only which was the survivor (or
minimum-distance path from the received sequence) at each node, as well as the
value of that minimum distance. This is necessary because, at the next node level,
we must compare the two branches merging at each node that were survivors at
the previous level for possibly different nodes; thus the comparison at node a after
the fourth branch is among the survivors of comparisons at nodes a and ¢ after the
third branch. For example, if the received sequence over the first four branches is
01000111, the survivor at the third node level for node a is 000000 with distance 2
and at node c it is 110101, also with distance 2. In going from the third node
level to the fourth, the received sequence agrees precisely with the survivor from c
but has distance 2 from the survivor from a. Hence the survivor at node a of the
fourth level is the data sequence 1100, which produced the code sequence 11010111
which is at (minimum) distance 2 from the received sequence.
In this way, we may proceed through the trellis and, at each step for each
state, preserve only one surviving path and its distance from the received
sequence; this distance is the metric* for this channel. The only difficulty which
may arise is the possibility that, in a given comparison between merging paths, the
distances or metrics are identical. Then we may simply flip a coin to choose one, as
was done for block codewords at equal distances from the received sequence. For
even if we preserved both of the equally valid contenders, further received symbols
would affect both metrics in exactly the same way and thus not further influence
* As defined in Sec. 2.2, the metric is the logarithm of the likelihood function. For the BSC, it is
convenient to use the negative of this which is proportional to Hamming distance [see also (4.2.2)].
CONVOLUTIONAL CODES 237
b
c
d
Figure 4.7 Example of decod-
ing for encoder of Fig. 4.2a on
Received 01 00 01 00 00 BSC: decoder state metrics
vector are encircled.
our choice. The decoding algorithm just described was first proposed by Viterbi
[1967a]; it can perhaps be better appreciated with the aid of Fig. 4.7 which
shows the trellis for the code just considered with the accumulated distance and
corresponding survivors for the particular received vector 0100010000 ....
It is also evident that, in the final decisions for the complete trellis of Fig. 4.4,
the four possible trellis states are reduced to two and then to one in the tail of the
code. While at first glance this appears appropriate, practically it is unacceptable
because it requires a decoding delay of B as well as the storage, for each state,
of path memories (i.e., the sequence of input bits leading to the most likely set
of four states at each node level) of length B. We shall demonstrate in Secs. 4.7
and 5.6 that performance is hardly degraded by proper truncation of delay and
memory at a few constraint lengths. For the moment, however, we shall ignore this
problem and be amply content with the realization that we have reduced the
number of decoding elements per data bit (metric calculations) to an exponential
growth in K — 1 (2-1 = 4 in this case) rather than in B.
Another description of the algorithm can be obtained from the state-diagram
representation of Fig. 4.5. Suppose we sought that path around the directed state
diagram, arriving at node a after the kth transition, whose code symbols are at a
minimum distance from the received sequence. But clearly this minimum distance
path to node a at time k can be only one of two candidates: the minimum distance
path to node a at time k — 1 and the minimum distance path to node ¢ at time
k — 1. The comparison is performed by adding the new distance accumulated in
the kth transition by each of these paths to their minimum distances (metrics) at
time k — 1.
It thus appears that the state diagram also represents a system diagram for
this decoder. With each node or state, we associate a storage register which
remembers the minimum-distance path into the state after each transition, as well
as a metric register which remembers its (minimum) distance from the received
238 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
sequence. Furthermore, comparisons are made at each step between the two paths
which lead into each node. Thus, one comparator must also be provided for each
state, four in the above example.
Generalization to convolutional codes of any constraint length K and any
rational rate b/n is straightforward. The number of states becomes 2°~ 1), with
each branch again containing n code symbols. The only modification required for
b > 1 is that due to the fact that 2” paths now merge at any given level beyond the
(K — 1)st; comparison of distance or metric must be made among 2? rather than
just two paths, and again only one survivor is preserved. Hence the potential
path population is reduced by a factor 2~° at each merging level, but it then
grows again by the factor 2° before the next branching level, thus keeping the
states constant at 2°*~ 1),
Generalization to arbitrary memoryless channels is almost as immediate.
First, we note that, just as in Sec. 2.9, we may map the branch vectors 0;,, v;2,...,
v;, into nonbinary signal vectors x; (of arbitrary dimension up to n) over an
arbitrary finite alphabet of, symbols (for example, amplitudes, phases, etc.). The
memoryless channel (including the demodulator, see Fig. 2.1) then converts these
symbols into noisy output vectors y; of dimension up to n. The Viterbi decoder?
is then based on the metric
B
Ul PY: | Xi)
or equivalently its logarithm
In TT P(¥ilXq:) = . In p(Yi| Xm) (4.2.1)
i=1 i
II
—
where x,,; is the code-subvector of the mth message sequence for the ith branching
level. For the BSC just considered, this reduces to
P(Yi|Xmi) = p*™(1 — py ™
where d,,,; is the distance between the n-dimensional received vector and the code-
subvector for the ith branch of the mth path. The logarithm of this metric for a
particular path is
x In p(y: |Xmi) = is Efe n( —?) +nin(l— °) (4.2.2)
i=1
Maximizing this metric is equivalent to maximizing
B
om oe (— da) oh B
i=1
> While the exact terminology is maximum likelihood decoder using the Viterbi algorithm (VA), it
has become common usage to call this simply the Viterbi decoder; we have chosen to adhere to
common usage, with apologies by the first author for this breach of modesty.
CONVOLUTIONAL CODES 239
where « is a positive constant (for p < 4) and f is completely arbitrary. We should
choose paths with maximum metric or, equivalently, we should minimize the
Hamming distance as we have done. For the binary-input constant energy
AWGN channel, on the other hand we have [see (2.1.15)]
B 2 B
>. In P(Yi| Xmi) ~~ a > (y;. Xmi) her B
i=1 N, i=1
2 B n
am a 2; 2. YijXmij — B (4.2.3)
oi=1i j=i1
where y;; is the jth symbol of the ith branch, x,,;; is the jth symbol of the ith branch
for the mth possible code path, and B is a constant. Maximizing this metric is
equivalent to maximizing the accumulated inner product of the received vector
with the signal vector for each path. Comparisons are made exactly as for the
BSC, except that the survivor in this case corresponds to the maximum inner
product rather than the minimum distance. A similar argument applies to any
memoryless channel, based on the accumulated metric given by (4.2.1).
For maximum likelihood decoding of general trellis codes, the Viterbi algor-
ithm proceeds exactly as for convolutional codes. Thus, only the encoder of trellis
codes differs essentially from that of convolutional codes as discussed in Sec. 4.1.
It would of course be desirable to be able to generate code symbols using the
simpler convolutional encoders, if the performance is the same. In Chap. 5, we
shall find that in most applications this is in fact the case.
4.3 DISTANCE PROPERTIES OF CONVOLUTIONAL CODES
FOR BINARY-INPUT CHANNELS
We found in Chap. 2 that the error probability for linear codes and binary-input
channels can be bounded simply by (2.9.19) in terms of the weights of all code
vectors, which correspond to the set of distances from any one code vector to all
others. Error performance of convolutional codes, which constitute a subclass of
linear codes, can similarly be bounded, but with considerably more explicit results
as we Shall discover below.
The calculation of the set of code path weights, or equivalently the set of
distances from the all-zeros path to all paths which have diverged from it, is
readily performed with the aid of the code trellis or state diagram. For expository
purposes, we again pursue the example of the K = 3, r = 5 code of Fig. 4.2a whose
trellis and state diagram are shown in Figs. 4.4 and 4.5, respectively. We begin by
redrawing the trellis in Fig. 4.8, labeling the branches according to their distances
from the all-zeros path.
Consider now all paths which merge with the all-zeros for the first time at
some arbitrary node j. It is seen from the diagram that, of these paths, there will be
just one path at distance 5 from the all-zeros path, and that this path diverged
from the latter three branches back. Similarly, there are two at distance 6 from the
240 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Figure 4.8 Trellis diagram labeled
with distances from the all-zeros
d path.
all-zeros path, one which diverged four branches back and the other which
diverged five branches back, and so forth. We note also that the input bits for the
distance 5 path are 00 --- 0100, and thus differ in only one input bit from those of
the all-zero symbols path (which of course consists of all input zeros) while the
input bits for the distance 6 paths are 00 --- 001100 and 00 --- 010100, and thus
each differs in 2 input bits from the all-zeros path. The minimum distance,
sometimes called the free distance, among all paths is thus seen to be 5. This
implies that any pair of errors over the BSC can be corrected, for two or fewer
errors will cause the received sequence to be at most distance 2 from the trans-
mitted (correct) sequence but it will be at least at distance 3 from any other
possible code sequence. It appears that with enough patience the distance of all
paths from the all-zeros (or any arbitrary) path can be determined from the trellis
diagram.
However, by examining instead the state diagram, we can readily obtain a
closed-form expression whose expansion yields all distance information directly.
We begin by labeling the branches® of the state diagram of Fig. 4.5 either D?, D, or
D° = 1, where the exponent corresponds to the distance of the particular branch
from the corresponding branch of the all-zeros path. Also we split open the node
a = 00, since circulation around this self-loop simply corresponds to branches of
the all-zeros path, whose distance from itself is obviously zero. The result is
Fig. 4.9. Now, as is clear from examination of the trellis diagram, every path which
first remerges with state a = 00 at node level j must have at some previous node
level (possibly the first) originated at this same state a = 00. All such paths can be
traced on the modified state diagram. Adding branch exponents we see that path
a bc a is at distance 5 from the correct path, paths abd caandabcbcaare
both at distance 6, and so forth, for the generating functions of the output sequence
weights of these paths are D° and D®, respectively.
Now we may evaluate the generating function of all paths merging with the
all-zeros at the jth node level simply by summing the generating functions of all
the output sequences of the encoder. This generating function, which can also be
© The parameters D, L, and J in this section are abstract terms.
CONVOLUTIONAL CODES 241
/
D2 ,
a= oo > —¢ > a=
Ssggt to Figure 4.9 State diagram labeled
ee om ef with distances from the all-zeros
a> c= path.
regarded as the transfer function of a signal-flow graph with unity input, can most
directly be computed by simultaneous solution of the state equations obtained
from Fig. 4.9
fp =D? +,
€. = Dé, + Dea
Ca = DO, + DEa
TID) =:D*E. (4.3.1)
where €,, €., and €, are dummy variables for the partial paths to the intermediate
nodes, the input to the a node is unity, and the output is the desired generating
function T(D). Solution of (4.3.1) for T(D) results in
= D° + 2D®° + 4D’ + --- + 2*DK*° + --- (4.3.2)
This verifies our previous observation, and in fact shows that, among the paths
which merge with the all-zeros at a given node, there are 2* paths at distance k + 5
from the all-zeros path.
Of course, (4.3.2) holds for an infinitely long code sequence; if we are dealing
with the jth node level, we must truncate the series at some point. This is most
easily done by considering the additional information indicated in the modified
state diagram of Fig. 4.10. The L terms will be used to determine the length of a
given path; since each branch has an L, the exponent of the L factor will be
augmented by one every time a branch is passed through. The / term is included
only if that branch transition was caused by an input data “ 1,” corresponding to a
dotted branch in the trellis diagram. Rewriting the state equations (4.3.1), includ-
ing now the factors in J and L shown in Fig. 4.10, and solving for the augmented
242 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Figure 4.10 State diagram labeled
with distance, length, and number
of input “1”s.
generating function yields
D°LI
HD; i fy=
ee = pit + Dy
= D°PI + D9 (14+ Ly? + D’P(1+ LPP
tee + DTI TL + LATE +: (4.3.3)
Thus we have verified that of the two distance 6 paths, one is of length 4 and the
other is of length 5, and both differ in two input bits from the all-zeros. Thus, for
example, if the all-zeros was the correct path and the noise causes us to choose one
of these incorrect paths, two bit errors will be made. Also, of the distance 7 paths,
one is of length 5, two are of length 6, and one is of length 7; all four paths
correspond to input sequences with three “1”s. If we are interested in the jth node
level, clearly we should truncate the series such that no terms of power greater
than JJ are included.
We have thus fully determined the properties of all code paths of this simple
convolutional code. The same techniques can obviously be applied to any binary-
symbol code of arbitrary constraint length and arbitrary rate b/n. However, for
b > 1, each state equation of the type of (4.3.1) is a relationship among at most
2° + 1 node variables. In general, there will be 2°“~ 1’ state variables and as many
equations. (For further examples, see Probs. 4.6, 4.17, and 4.18.) In the next two
sections we shall demonstrate how the generating function can be used to bound
directly the error probability of a Viterbi decoder operating on any convolutional
code on a binary-input, memoryless channel.
4.4 PERFORMANCE BOUNDS FOR SPECIFIC
CONVOLUTIONAL CODES ON BINARY-INPUT, OUTPUT-
SYMMETRIC MEMORYLESS CHANNELS
It should be reasonably evident at this point that the block length nB of a
convolutional code is essentially irrelevant, for both the encoder and decoder
complexity and operation depend only on the constraint length K, the code rate,
CONVOLUTIONAL CODES 243
Figure 4.11 Example of error events.
and channel parameters; furthermore, the performance is a function of relative
distances among signals, which may be determined from the code state diagram,
whose structure and complexity depends strongly on the constraint length but not
at all on the block length. Thus it would appear that block error probability is not
a reasonable performance measure, particularly when, as is often the case, an
entire message is convolutionally encoded as a single block, whereas in block
coding the same message would be encoded into many smaller blocks. Ultimately,
the most useful measure is bit error probability P, which, as initially defined in
Sec. 2.11, is the expected number of bit errors in a given sequence of received bits
normalized by the total number of bits in the sequence.
While our ultimate goal is to upper-bound P,, we consider initially a more
readily determined performance measure, the error probability per node, which we
denote P,. In Fig. 4.11 we show (as solid lines) two paths through the code trellis.
Without loss of essential generality, we take the upper all-zeros path to be correct,
and the lower path to be that chosen by the maximum likelihood decoder. For this
to occur, the correct path metric increments over the unmerged segments must be
lower than those of the incorrect (lower solid line) path shown. We shall refer to
these error events as node errors at nodes i, j, and k. On the other hand, the dotted
paths which diverge from the correct path at nodes j’ and k’ may also have higher
metric increments than the correct path over the unmerged segments, and yet not
be ultimately selected because their accumulated metrics are smaller than those of
the lower solid paths. We may conclude from this exposition that a necessary, but
not sufficient, condition for a node error to occur at node j is that the metric of an
incorrect path diverging from the correct path at this node accumulates higher
metric increments than the correct path over the unmerged segment.
We may therefore upper-bound the probability of node error at node j by the
probability that any path diverging from the correct path at node j accumulates
higher total metric over the unmerged span of the path.
P{j)< Pr} \) {AM(xj, x;) > 0} (4.4.1)
x; € 2'(j)
where x; is an incorrect path diverging from the correct path at node j, 2’(j) is the
set of all such paths, known as the incorrect subset for node j, AM(x;, x,) is the
difference between the metric increment of this path and of the correct path x;
over the unmerged segment.
Employing the union bound, we obtain the more convenient, although looser,
form
Pans >. Pr{amMe,,x,) > (4.4.2)
xje Xj)
244 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
But each term of this summation is the pairwise error probability for two code
vectors over the unmerged segment. For a binary-input channel, this is readily
bounded as a function of the distance between code vectors over this segment. For,
if the total Hamming distance between code vectors x, and x; (over their un-
merged segment) is d(x}, x,;)=d, we have from (2.9.19) that, for an output-
symmetric channel, the pairwise error probability is bounded by the
Bhattacharyya bound
Py <exp |dln 2 J Poly)p1(¥) (4.4.3)
where p;(y) is the conditional (channel transition) probability of output y given
that the input symbol was i(i = 0, 1). Equivalently, we may express this bound in
the more convenient form
ie oe oe (4.4.4)
where
Z=Y \/oly)pily) (3.4.12)
Thus given that there are a(d) incorrect paths which are at Hamming distance d
from the correct path over the unmerged segment, we obtain from (4.4.1) through
(4.4.4)
[.@)
P(j)< ¥ Pr jerror caused by any one of a(d)|
ee d=dy \incorrect paths at distance d |
a(d)Z* | (4.4.5)
where d, is the minimum distance of any path from the correct path, which we
called the free distance in the last section. Clearly (4.4.5) is a union-Bhattacharyya
bound similar to those derived for block codes in Chap. 2.
We also found in the last section that the set of all distances from any one path
to all other paths could be found from the generating function T(D). For demon-
stration purposes, let us consider again the code example of Figs. 4.2a, 4.4, and 4.5.
We found then that
= D> + 2D° + 4D’ +---+ 2% °DF +
itt + 94-5 pd
2
Thus in this case dy = 5 and a(d) = 2*-“/. The same argument can be applied to
any binary code whose generating function we can determine by the techniques of
CONVOLUTIONAL CODES 245
the last section. Thus we have in general that
= ¥ a(d)D* (4.4.6)
d=df
and it then follows from (4.4.5) and (4.4.6) that
P.(j) < T(D) (4.4.7)
D=Z
We note also that this node error probability bound for a fixed convolutional code
is the same for all nodes when B= o0 and that this is also an upper bound for
finite B.
Turning now to the bit error probability, we note that the expected number of
bit errors, caused by any incorrect path which diverges from the correct path at
node j, can be bounded by weighting each term of the union bound by the number
of bit errors which occur on that incorrect path. Taking the all-zeros data path to
be the correct path (without loss of generality on output-symmetric channels), this
then corresponds to the number of “1”s in the data sequence over the unmerged
segment. Thus the bound on the expected number of bit errors caused by an
incorrect path diverging at node j is
Em < XY tala, Pas YY ia (4.4.8)
where a(d, i) is the number of paths diverging from the all-zeros path (at node j) at
distance d and with i “1”s in its data sequence over the unmerged segment. But
the coefficients a(d, i) are also the coefficients of the augmented generating func-
tion T(D, I) derived in the last section. For the running example, we have from
(4.3.3) (with L = 1 since we are not interested in path lengths)
D?I
Hd, 1) = 1 9DI iP ead 4 AD i. 4. Dr +
v3 + 24-5 pyaya-4
»
and hence
a(d ya fori=d—4,d>5
\0 otherwise
In this case then,
6T(D, 1)
Gl. ex box
Elm) < Yd 4)24-8\z4 =
246 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
In general it should be clear that the augmented generating function can be
expanded in the form
T(D,1)= ¥. ¥. a(d, i)D*I' (4.4.9)
i=1 d=dys
whose derivative at I = 1 is
OT(D, I soles
DD) aos Seta nD (4.4.10)
OL |te1 i=1 a=ay
Consequently, comparing (4.4.8) and (4.4.10), we have
@T(D, 1)
(4.4.11)
I=1,D=Z
This is an upper bound on the expected number of bit errors caused by an
incorrect path diverging at any node j.
For a rate 1/n code, each node (branch) represents one bit of information into
the encoder or decoder. Thus the bit error probability defined as the expected
number of bit errors per bit decoded is bounded by
éT(D, 1)
al
as shown in (4.4.11). For a rate b/n code, one branch corresponds to b information
bits. Thus in general
P,(j) = E(n(i)} <
(4.4.12)
I=1,D=Z
pip Bll LATO.) ary
where Z is given by (3.4.12).
4.55 SPECIAL CASES AND EXAMPLES
It is somewhat instructive to consider the BSC and the binary-input AWGN
channel, special cases of the channels considered in the last section. Clearly the
union-Bhattacharyya bounds apply with [see (2.11.6) and (2.11.7) and (3.4.15)
and (3.4.17)]
(Z)psc = \/4P(1 — P) (4.5.1)
oa (Z)awon = e */%2 (4.5.2)
We note also, as was already observed in Sec. 2.11, that if the AWGN channel is
converted to the BSC by hard quantization for &,/N, < 1, then
ee é.
ons 2 tN,
CONVOLUTIONAL CODES 247
in which case
be 2€,|_ (2/ne,
—In Z = —In[,/1 — 46,/xN,] > —In i - 25 ~ N,
for a loss of 2/z, or approximately 2 dB, in energy-to-noise ratio.
However, for these two special channels, tighter bounds can be found by
obtaining the exact pairwise error probabilities rather than their Bhattacharyya
bounds. For the BSC, we recall from (2.10.14) that, for unmerged segments at
distance d from the correct path’
d
(i)p\(1— py * — d odd
k=(d+1)/2
d
p—
5 (ay2)p"*(1 — py? + S Gp(l— prs deven (453)
k=d/2+1
This can be used in the middle expressions of inequalities (4.4.5) and (4.4.8) to
obtain tighter results than (4.4.7) and (4.4.12) (see also Prob. 4.10).
Similarly, for the binary-input AWGN channel, we have from (2.3.10) that the
pairwise error probability for code vectors at distance d is
P, = Q(,/2d&,/N,) (4.5.4)
While we may substitute this in the above expressions in place of Z4 = e 4¢"%o a
more elegant and useful expression results from noting that (Prob. 4.8)
Q(./x + y)<O(/x)e-"? x > 0, y>0 (4.5.5)
Since d > d; we may bound (4.5.4) by
Pe (| =L%) e-u-aneim (4.5.6)
oO
which is tighter than the Bhattacharyya bound. Substituting in the middle terms
of (4.4.5) and (4.4.8), then using (4.4.6) and (4.4.10), we obtain
2d ,-& ones }
+ moe A etSés/No d) e~46s/No
e—> o N, ez at ) ec
zs of =e | AEE E(D) (4.5.7)
i, D=e-—€és/No
7 Ties are assumed to be randomly resolved. Note that unlike the block code case for which
(2.10.14) holds, all probabilities here are for pairwise errors.
248 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
and
2d -& Hs
esti SY ia(d, i)D!
i=1 d=drs
i. oAs6sINo Bae dT(D, 1)
D=e-—és/No
> ie
Il=1, D=e-—€s/No
(4.5.8)
The last bound has been used very effectively to obtain tight upper bounds for
the bit error probability on the binary-input AWGN channel for a variety of
convolutional codes of constraint lengths less than 10. For, while the computation
of T(D, I) for a constraint length K code would appear to involve the analytical
solution of 2°*~ " simultaneous algebraic equations (Sec. 4.3), the computation of
T(D, I) for fixed values of D = Z and I becomes merely a numerical matrix
inversion. Also since T(D, I) is a polynomial in J with nonnegative coefficients and
has a nondecreasing first derivative for positive arguments, the derivative at J = 1
can be upper-bounded numerically by computing instead the normalized first
difference. Thus
CT(D, 1) 2 T(Z, 1 + €) — T(Z, 1)
3 Daag gee €
e<l (4.5.9)
Even the numerical matrix inversion involved in calculating T(D, I) for fixed D
and I is greatly simplified by the fact that the diagonal terms of the state equations
matrix [see (4.3.1) and Probs. 4.17 and 4.18] dominate all other terms in the same
row. As a result, the inverse can be computed as a rapidly convergent series of
powers of the given matrix (see Prob. 4.18). The results for optimum rate 4 codes®
of constraint length 3 through 8 are shown in Fig. 4.12. To assess the tightness of
these bounds we show also in the figure the results of simulations of the same
codes, but with output quantization to eight levels. For the low error probability
region (&,/N, > 5 dB), it appears that the upper bounds lie slightly below the
simulation. The simulations should, in fact, lie above the exact curve because the
quantization loss is on the order of 0.25 dB (see Sec. 2.8). This, in fact, appears to
be the approximate separation between simulation and upper bounds, attesting to
the accuracy of the bounds.
In all codes considered thus far, the generating function sequence
T(D)= Y a(d)D¢ (4.5.10)
was assumed to converge for any value of D less than unity. That this will not
always be true is demonstrated by the example of Fig. 4.13. For this code, the self
® The codes were selected on the basis of maximum free distance and minimum number of bit
errors caused by incorrect paths at the free distance, i.e., minimum a(d,, i) (Odenwalder [1970)).
1073 T T ee iia Seas |
— —— Upper
bound
10-4 E
qT
-O—o- Simulation
ae eS |
piel
ae
>
= 5 £05
isvJ
Ks) 107° — =i
a. a Z
bs “ 3
e x Be
oO a cA
ea z - q
Ce a \ \ \ =
: \ \ _
5 \ \ \ :
# \ \ e
: Needy 454
\ \
1077 | l Eee bem tae | \ | ee &
3 4 5 6 7
&,/No, decibels
(a) K=3,5,7
10-3 T l T eo T T l T T
s <—o= Simulation —
ES —— — Upper re
5 bound ee
10-4 i x
ae 3 z
> aa K = al
5
S 103- K seat
S E a a | a
o a x4 \ \ \ Eo
me & \ \ \ al
a ga \
—6 | ee
inhi 3 ee See :
3 \ =
= \ \ _
a \ \ \ a
gE \ \ \ =
\
10-7 gules Picks tie "od Wisk vaakde Bak \ meee \ |
3 4 5 6 7
& b INo ; decibels
(b) K=4, 6,8
Figure 4.12 P, as a function of &,/N, for Viterbi decoding of rate } codes: simulations with eight-
level quantization and 32-bit path memory (solid); upper bounds for unquantized AWGN (dotted).
(Courtesy of Heller and Jacobs [1971].)
249
250 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
>(+
Figure 4.13 Encoder displaying catastrophic error propagation and its state diagram.
loop at state d does not increase distance, so that the path abddd ... ddca will be at
distance 6 from the correct path no matter how many times it circulates about this
self-loop. Thus it is possible on a BSC, for example, for a fixed finite number of
channel errors to cause an arbitrarily large number of decoded bit errors. To
illustrate in this case, for example, if the correct path is the all-zeros and the BSC
produces two errors in the first branch, no errors in the next B branches and two
errors in the (B + 1)st branch, B — 1 decoded bit. errors will occur for an arbi-
trarily large B. For obvious reasons, such a code, for which a finite number of
channel errors (or noise) can cause an infinite number of decoded bit errors, is
called catastrophic.
It is clear from the above example that a convolutional code is catastrophic
if and only if, for some directed closed loop in the state diagram, all branches
have zero weight; that is, the closed loop path generating function is D°. An even
more useful method to ensure the avoidance of a catastrophic code is to establish
necessary and sufficient conditions in terms of the code-generator sequences g,.
For rate 1/n codes, Massey and Sain [1968] have obtained such conditions in terms
of the code generator polynomials which are defined in terms of the generator
CONVOLUTIONAL CODES 251
sequences as”
g,(z) = 1 + 914249242" eet + Gx-1».22 * k= ) # ° Ey
In terms of these polynomials, the theorem of Massey and Sain (see Prob. 4.11)
states that a fixed convolutional code is catastrophic if and only if all generator
polynomials have a common polynomial factor (of degree at least one). Also of
interest is the question of the relative fraction of catastrophic codes in the en-
semble of all convolutional codes of a given rate and constraint length. Forney
[1970] and Rosenberg [1971] have shown that, for a rate 1/n code, this fraction is
1/(2” — 1), independent of constraint length (see Prob. 4.12). Hence generally, the
search for a good code is not seriously encumbered by the catastrophic codes,
which are relatively sparse and easy to distinguish.
One subclass of convolutional codes that are not catastrophic is that of the
systematic convolutional codes. As with systematic block codes, systematic convo-
°In this context, z is taken to be an abstract variable, not a real number. The lowest order
coefficient can always be taken as one without loss of optimality or essential generality.
Table 4.1 Maximum free distance
of noncatastrophic codes
Rate r=
Systematict Nonsystematic
K d, d,
2 3 3
3 4 5
4 + 6
5 5 7
6 6 8
7 6 10
8 7 10
Rate r=4
Systematic? Nonsystematic
K d, d,
2 5 5
3 6 8
4 8 10
5 9 12
6 10 13
7 12 15
8 12 16
+ With feed-forward logic.
252 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
lutional codes have the property that the data symbols are transmitted unchanged
among the coded symbols. For a systematic rate b/n convolutional code, in each
branch the first b symbols are data symbols followed by n — b parity or coded
symbols. The coded symbols are generated just as for nonsystematic codes, and
consequently depend on the last Kb data symbols where Kb is the constraint
length. Since data symbols appear directly on each branch in the state or trellis
diagram, for systematic convolutional codes it is impossible to have a self-loop in
which distance to the all-zeros path does not increase, and therefore these codes
are not catastrophic.
In Sec. 5.7, we show that systematic feed-forward convolutional codes do not
perform as well as nonsystematic convolutional codes.'!° There we show that, for
asymptotically large K, the performance of a systematic code of constraint length
K is approximately the same as that of a nonsystematic code of constraint length
K(1 —r) where r= b/n. Thus for rate r = 5 and very large K, systematic codes
have about the performance of nonsystematic codes of half the constraint length,
while requiring exactly the same optimal decoder complexity.
Another indication of the relative weakness of systematic convolutional codes
is shown in the free distance, d, , which is the exponent of D in the leading term of
the generating function T(D). Table 4.1 shows the maximum free distance achiev-
able with binary feed-forward systematic codes and nonsystematic codes that are
not catastrophic. We show this for various constraint lengths K and rates r. As
indicated by the results of Sec. 5.7, for large K the differences are even greater.
4.6 STRUCTURE OF RATE 1/n CODES AND ORTHOGONAL
CONVOLUTIONAL CODES
While the weight or distance properties of the paths of a convolutional code
naturally depend on the encoder generator sequences, both the unmerged path
lengths and the number of “1”s in the data sequence for a particular code path
are functions only of the constraint length, K, and rate numerator, b. Thus for
example, for any rate 1/n, constraint length 3 code [see (4.3.3)]
PI
1—L(1+Ly)I
(kL, j= (4.6.1)
To obtain a general formula for the generating function T,(L, I) of any rate 1/n
code of constraint length K, we may proceed as follows. Consider the state just
prior to the terminal state in the state diagram of a constraint length K code (see
Fig. 4.10 for K = 3). The (K — 1}dimensional vector for this state is 10... 0.
Suppose this were the terminal state and that when a path reached this state it was
considered absorbed (or remerged) without the possibility to go on to either of the
’° It can be shown (Forney [1970]) that for any nonsystematic convolutional code, there is an
equivalent systematic code in which the parity symbols are generated with linear feedback logic.
CONVOLUTIONAL CODES 253
states 0 ...0 or 0... 01. Then the initial input into the encoder register could be
ignored, and we would have a code of constraint length K — 1. It follows that the
generating function of all paths from the origin to this next-to-terminal state, must
be Tx -,(L, I). Now, if an additional “0” enters when the encoder is in this state,
the terminal state is reached. If, on the other hand, a “1” enters we are effectively
back to the situation of initial entry into the state diagram; that is, the “1” takes
us to state 00 ... 1 with the branch from 10 ... 0 playing the same role as that from
the initial state. This implies that the recursion formula for the generating function
Tx(L, I) is
We ie Lie Ae EE. Ka? (467)
In words, to arrive at the terminal state, since we must first pass through the state
100...0, the first term on the right corresponds to a “0” entering when the
encoder is in this state, in which case the terminal state is reached with an addition
of one branch length (with data zero); the second term on the right corresponds
to an input “1,” in which case we may treat the state 100 ... 0 as if it were the
initial state and the terminal state can only be reached by following one of
the paths of T,(L, I). From (4.6.2), we immediately obtain
bie ALD)
Tx(L, 1) = == TL) K>2 (4.6.3)
Trivially, for K = 1
T,(L, 1) = LI (4.6.4)
Then the solution of (4.6.3) is obtained by induction as
fir i) ue
‘ —1-IL(l+L+2 +--+ 1?)
1IX(1 — L)
ae y+ - iE) K>2 (4.6.5)
If only the path length structure is of interest, we may restrict attention to
IF(1 — L)
| oy 4.6.6
ee ER og oe ( )
Tx(L) = Tx(L, 1) =
We shall utilize these results in the next chapter when we treat convolutional
code ensembles. We conclude this discussion by considering a class of codes
whose distance or weight properties are the same for all branches, and con-
sequently whose performance depends only on the path structure. Such a class of
codes is the orthogonal, rate 2~ * convolutional codes generated by the encoder of
Fig. 4.14. The block orthogonal encoder generates one of 2* orthogonal binary
sequences of dimension n = 2* (as described in Sec. 2.5). Hence the weight of any
branch not on the all-zeros data path is exactly 2*~ * = n/2. Thus for this class of
254 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
poeple
!
y y K inputs |
oO
O
Block orthogonal encoder ; re
Oo
n= 2* outputs
Figure 4.14 Convolutional orthogonal encoder: constraint length K; rate r= 27*.
codes, since each branch has weight n/2, Tx(D, I) is obtained from T,(L, I) by
replacing L by D"’* everywhere, and thus
IDKn/2(] ms D"!)
1 — D"?[1 + 1(1 — DY E-D)
p*/2(] Me pD"?)
1 — 2p"? + D*"?
Tx(D, I) =
(4.6.7)
T(Dy (4.6.8)
Consequently, employing (4.4.7) and (4.4.12), we obtain the node and bit error
probabilities for the AWGN channel for which Z = e” °*"° as
VA © | bi y Ae ZKn/2
A a Tx(D) ee aie ee Wnl2 + Zkni2 ss ooh aznl2 (4.6.9)
OT, (D, 1) Y Aa © ea y Aisa ZKn/2
fers) ee 2s 4.6.10
oe él pt pee Mae 2 Senne . (1— Ae ( )
Recognizing that, since for a rate 1/n code there are n code symbols/bit
Zz" Ais e Nés/No a e 6b/No
we find
—Ké&p/2No
€
et < 1 a JQe~ v/2No
e Kév/2No
(1 isi OF Me dobetate 5
We recall also from Sec. 2.5 [Eqs. (2.5.13) and (2.5.18)] that
6,/(N, In 2) = Cy/Ry (4.6.12)
Pes &,/N,> 2 In 2 (4.6.11)
where R, the transmission rate, and C,, the capacity, are in nats per second and
CONVOLUTIONAL CODES 255
the transmission time per bit is
T, = In 2/R, s/bit
Thus (4.6.11) becomes
2 —KCr/(2RrT)
ies (4.6.13)
+ i 4 —Cr/(2R7)
1
2 -KCr/(2Rr) R,/Cy <3
P, < (4.6.14)
[1 dic an; fVGFaP
For orthogonal block codes we were able to show in Sec. 2.5 that the error
probability decreases exponentially with block length for all R; < C,. We recall,
however, that to obtain that result we employed a more refined bounding
technique than the union bound; we now use a similar approach for convolutional
codes to demonstrate an exponential bound in terms of constraint length for all
rates up to capacity.
We begin with node error probability, and recall that an error can occur at
node j only if an incorrect path diverging at this node from the correct path has
higher metric upon remerging. From the generating function T,(L) of (4.6.6), we
can determine all unmerged path lengths, and from this the totality of diverging
paths which remerge again a number of branches ahead. This formula can be
somewhat simplified, if we bound (4.6.6), in the sense of counting for every L
more paths than actually exist, as follows
Tx (L : om SRK + 4.6.15
Dae 2 (4.6.15)
Thus, of the totality of paths diverging at a given node, there are no more than 2*
incorrect paths which merge after K + k branches; as we shall find, this overesti-
mate of number of paths (by approximately double) has negligible asymptotic
effect. Now for an orthogonal convolutional code, all paths which are unmerged
from the correct path for K + k branches have code vectors which are orthogonal
to it over this entire unmerged segment. The node error probability can be
bounded by
eae Te (4.6.16)
k=0
- fetror caused by any one of no more than 2* |
where Il, =
=s | incorrect paths unmerged over K + k branches|
This, of course, is a union bound, but rather than summing over all individual
path error events, as in Sec. 4.4, we treat as a single event all errors caused by
paths unmerged for the same number of branches. Now instead of bounding the
probability of these events by a union bound over their members, as was done
before, we employ a Gallager bound. In fact, we may apply precisely the deriva-
tion of Sec. 2.5, based on the Gallager bound (2.4.8), to the set of up to 2* incorrect
S
256 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
paths, unmerged with, and hence orthogonal to, the correct path over K +k
branches. Noting that this argument does not require all code vectors to be
mutually orthogonal, but only that each incorrect code vector be pairwise ortho-
gonal to the correct code vector, we thus have from (2.5.12), for orthogonal codes
on the AWGN channel
I, < 2" exp |- “ = aa f )
G<y< 6.
N, ee <p <1 (4.6.17)
since the energy over this segment is (K + k) times the energy per branch, which
equals &, for a rate 1/n code. Substituting (4.6.17) into (4.6.16), then using (4.6.12)
yields
—K¢,/ - | Ep
Ene OX exp \—kp |—_——-. —- In 2
P| Ng fey LS. , ‘ N,(1 + p) ‘
2 —K(Cr/Rr)e/( +p)
=[_pet-criRraton «=O SP SI (4.6.18)
Clearly if we take p = 1, we obtain the bound (4.6.13). On the other hand, taking
for some 0 <€ <1
p= ed —e)-1 (4.6.19)
we have
2~ K(Cr/Rr)— 1/1 — 6)
i e re 7-MCTRD- 110-91 C,/2 Rr ~ Cr(1 = €) (4.6.20)
Thus we have an exponential decrease in K for all R; < C (1 — €). For asympto-
tically large K, the denominator becomes insignificant so that we can let « > 0.
To bound the bit error probability requires little more effort if we recognize
that a node error due to an incorrect path which has been unmerged over K + k
branches can cause at most k + 1 bit errors; for in order to merge with the correct
path, the last K — 1 data bits for the incorrect path must coincide with those of the
correct path. It follows then that the bit error probability is bounded by a summa-
tion of the form of (4.6.16) with each term weighted by k + 1. Thus
P,< > (K+ 1)M,
k=0
Now using (4.6.17), and recognizing that
7 Ve be |
Dany ade ania vee ¢.-y.
CONVOLUTIONAL CODES 257
we have
Pi <x k + 1) exp {—kp | -_ In2
; P| as ee ay |
ag K(C1/Rr)p/(1 + p)
(1 — 201 —CritRrt + omy2
(4.6.21)
Finally, applying (4.6.19) we obtain, analogously to (4.6.20)
9- KU(Cr/Rr)- 11-9)
(1 — 2-A&(Cr/Rr)- 1/1 B27
P, < Cpl Ry oCAb— a) 6 <€ 144.622)
Combining (4.6.13), (4.6.14), (4.6.20), and (4.6.22), we obtain
+e KEART)/RT
Py
eS OO Pile ah JO<R;<C,/(1— 6)
9 — KEAR1)/RT | T, = In 2/R,;
ae fi 2 RMA (4.6.23)
where 6(R,) > 0 for « > 0 and
BIR dele ie as ete (4.6.24)
1ACr. — Krill — 6) C7/2 < Rr < C7(1 — €)
Figure 4.15 compares E,(R), the convolutional exponent as ¢ > 0, with the block
exponent E(R,) of (2.5.16) as a function of R;/C;. Comparing the latter with
(4.6.23), we note that T for block codes is the time to transmit K bits, as is K T, for
convolutional codes; thus (2.5.16) can be expressed as
fs < 7 —KE(R7)/Rr
with E(R,) as defined there. Hence, as is seen from the comparison of E,(R,;) and
E(R,) in Fig. 4.15, the convolutional coding exponent clearly dominates the block
coding exponent for orthogonal codes.
ne
lim E, (Rp)
E=0
eat?
0
Figure 4.15 Limiting form of E.(R;) for orthogonal convolutional codes and comparison with
orthogonal block codes.
258 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Comparing decoding complexity, a maximum likelihood block decoder per-
forms 2* comparisons every K bits or 2*/K comparisons per bit while a Viterbi
maximum likelihood convolutional decoder performs 2*~ ! comparisons per bit;
the difference becomes insignificant for large K. On the other hand, the bandwidth
expansion of block codes is proportional to 2*/K while it is proportional to 2*~!
for convolutional codes, a severe drawback; this, however, is a feature only of
orthogonal codes, and in the next chapter we shall show by ensemble arguments
that, for the same bandwidth expansion (or code rate), the convolutional code
exponent dominates the block exponent in a similar manner for all memoryless
channels.
4.7 PATH MEMORY TRUNCATION, METRIC
QUANTIZATION, AND CODE SYNCHRONIZATION
IN VITERBI DECODERS
In deriving the Viterbi algorithm for maximum likelihood decoding of convolu-
tional codes in Sec. 4.2, we made three impractical assumptions which are, in
order of importance, arbitrarily long path memories, arbitrarily accurate metrics,
and perfect code synchronization. All three of these requirements can be eli-
minated with a minimal loss of performance, as we shall now discuss.
As initially described, the algorithm requires that a final decision on the most
probable code path be deferred until the end of the code block, or message, when
the trellis merges into the single state 0 by insertion of a b(K — 1) zero “tail” into
the coder register. Thus if the message or block length is (B — K + 1)b bits, the
decoder must provide a register of this length for each of the 2°*~") possible
states. One obvious remedy to this situation is to limit B to some manageable
number, say on the order of 1000 or less, by terminating the “block” with a
b(K — 1) zero bit tail every (B — K + 1)b data bits. This has two.disadvantages,
however. First, it reduces the efficiency by increasing the effective required &,/N,,
and the required bandwidth, by a multiplicative factor of 1 + [(K — 1)/B]; also it
requires interruption of the data bit stream periodically to insert nondata tails, a
common drawback of block codes.
These drawbacks can be avoided by simple modification of the basic algor-
ithm. The simplest approach is to recognize that, other than in a catastrophic
code, a path which is unmerged from the correct path will accumulate distance
from it as an increasing function of the length of the unmerged span. Thus, upon
merging, an incorrect path with a very long unmerged span will have very low
probability of having higher metric than the correct path since this probability
decreases exponentially with distance. Consequently, with very high probability,
the best path to each of the 2*~") states will have diverged from the correct path
only within a reasonably short span, typically a few constraint lengths. Thus
without ever inserting tails, we may truncate the path memory to say five con-
straint lengths, using shift registers of length 5bK for each state. As each new set of
b bits enters the registers of each state, the b bits which entered 5K branch times
CONVOLUTIONAL CODES 259
earlier are eliminated, but as this occurs the decoder makes a final decision on
these bits, either by choosing for each set of b bits the oldest shift register contents
of the majority of the states, or more simply, by accepting the contents of an
arbitrary state, on the grounds that, with high probability, all paths will be identical
at this point and before. The analysis of the loss in performance caused by these
truncation strategies appears quite difficult, but simulations indicate a minimal
loss when path memories are truncated more than five constraint lengths back.
A better, but more complex, truncation strategy is to compare all likelihood
functions or metrics after each new branch, not only in groups of 2° but also
among all 2“~ »’ surviving paths, to determine the most probable path leaving the
given node at the truncation point; then among the 2°“~ !) paths, we choose the
outputs corresponding to the highest metric (several constraint lengths forward).
In a somewhat superior manner, this strategy permits reduction of the memory
length to 4bK or less in practical situations as has been determined by simulation.
In addition, the loss in performance due to truncation with this decision strategy
can be analyzed on an ensemble basis, as will be shown in Sec. 5.6 (also see
Prob. 4.24).
The second inherent assumption has been that accumulated branch metrics
B
2, In Plyi|%mi)
can be stored precisely. We note that, other than for the BSC where they simplify
to integers, the metrics will be real numbers. For example, on an AWGN channel,
they consist of linear combinations of the demodulator outputs for each symbol
[see (4.2.3)]. Even if these symbols are quantized to J levels, this does not mean
that the metrics are quantized; for as shown in Sec. 2.8 (see Fig. 2.14) the
quantized channel is characterized by a transition probability matrix {p(b;|a,)}
whose logarithms are real numbers.'! Nevertheless, it has been found, again by
extensive simulation (see Heller and Jacobs [1971]), that for all values of &,/N,,
for an eight-level quantized AWGN channel with binary (biphase or quadriphase
modulated) inputs, use of the definitely suboptimal metric
3 (Yi, Xmi)
where the y; are n-dimensional vectors of quantized demodulator outputs with
integer values from zero to seven, results in a total performance loss of only
0.25 dB relative to that with unquantized demodulator outputs and unquantized
metric (see Fig. 4.12). Another problem arises because, even though we may
quantize each branch metric to a reasonable number of bits to be stored, the
accumulated metrics will grow linearly with the number of branches decoded. This
difficulty is easily avoided by renormalizing the highest metric to zero simply by
subtracting an equal amount from each accumulated metric after each branch
"! See Prob. 4.20 for a bounding technique for a decoder which uses integer metrics.
260 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
jmaes ae te IS8 he a aes Be “ts SEK 2-f+K-1
State a
a e
yo) SBE aaasaricts
Figure 4.16 Unnormalized metrics for direct paths from state a at node j to states a and b at node
j+K-—-1.
calculation. The maximum spread among all 2"*~ ) state metrics is easily bounded
as follows. Suppose that the greatest branch metric possible is zero and the least is
the negative integer — v, which we can guarantee by subtracting a constant from
all possible branch metrics. For the binary-input, octal-output quantized AWGN
channel just discussed with rate 1/n coding, v = 7n. Then it follows easily that the
maximum spread in metrics for a constraint length K, rate b/n code is (K — 1)y,
for any state can be reached from any other state in at most K — | transitions. At
any node depth j + K — 1, consider the highest metric state a, without normaliza-
tion, and any other state b. There exists a path (not necessarily the surviving path)
which diverged from the path to state a at node j and arrives at state b at node
j+K — 1 (see Fig. 4.16). Now since all branch metrics lie between zero and — 1,
the metric change in the path to a over the last K — 1 branches is nonpositive
while the metric change in the path to state b must be between zero and
—(K — 1)y; hence the spread is no greater than (K — 1)v. Now if this particular
path did not survive, this can only be due to the fact that the surviving path to
state b has higher metric than this path; hence the spread will be even less. In
conclusion then, if we renormalize by adding an integer to bring the highest state
metric to zero after each branch calculation, the minimum branch metric is never
smaller than —(K — 1)v, so we need only provide [log (K — 1)v] bits of storage
for each state metric (where [x] denotes the least integer not less than x).
Finally, we consider the synchronization of a Viterbi decoder. For block
codes, it is obvious that, without knowledge of the position of the initial symbol of
each received code vector, decoding cannot be performed. Hence block coding
CONVOLUTIONAL CODES 261
systems either incorporate periodic uncoded synchronization sequences which
permit the receiver initially to acquire the code synchronization, or they modify
the block code so as to cause unsynchronized code vectors to be detected as such.
In the first case, the effective data rate is reduced by insertion of the uncoded
synchronization sequence, while in the second a relatively complex synchroniza-
tion system must be provided in the decoder. These difficulties are greatly reduced
in convolutional decoders. In an unterminated convolutional code, it would
appear that we require both branch and symbol synchronization. In a binary-
input, rate b/n code, symbol synchronization refers to knowledge of which of n
successive received symbols initiates a branch; let us assume initially that this has
already been acquired. On the other hand, branch synchronization refers to know-
ledge of which branch in the code path is presently being received. But if symbol
synchronization is known, branch synchronization is not required. For suppose
that, rather than initiating the decoding operation at the initial node (all-zeros in
the encoder) as we have always assumed, we were to begin in the middle of the
trellis. The Viterbi algorithm is identical at each node; the only problem would be
how to choose the initial values of the 2°*~?) state metrics. In normal decoding
when correct decisions are being made, one metric, generally corresponding to the
correct path or at least to a path which diverged from it only a few branches back,
will be largest; but, when errors are occurring, paths unmerged from the correct
path will have the highest metric so that conceivably the correct path might even
have the lowest metric at a given node. Yet we have seen that with probability one
for all but catastrophic codes, error sequences are of finite length so that, even
from the worst condition when the correct path has lowest metric at a given node,
the decoder will eventually recover and resume making correct decisions. Thus it
is clear that if we start decoding at an arbitrary node with all state metrics set to
zero, the decoder performance may initially be poor for several branches but, after
at most a few constraint lengths, it will recover in the sense that the correct path
metric will begin to dominate, and the data will be decoded correctly in much the
same way as the decoder recovers after a span of decoding errors. Analysis of this
effect on an ensemble basis is very similar to that of path memory truncation, and
is treated in Sec. 5.6.
Thus the only real synchronization problem in Viterbi decoding of convolu-
tional codes is that of symbol synchronization. Here, fortuitously, we have the
reverse of the above situation. In a rate b/n code, if the wrong initial symbol
out of n possible consecutive symbols is initially assumed, the correct path branch
metrics will appear much like those of other paths; thus all path metrics will tend
to remain relatively close together with no path emerging with more rapidly
growing metric than all the others (see for example Prob. 4.19); this condition can
be easily detected and the initial assumption of the initial symbol changed to
another of the n possibilities. Thus at most n positions must be searched; even
with enough time spent at each position to be able to exclude the incorrect
hypotheses with low probability of error, symbol synchronization can be achieved
within a few hundred bits when n is on the order of four or less.
262 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
48 FEEDBACK DECODING*
We have thus far considered only maximum likelihood decoding of convolutional
codes, developing and analyzing the Viterbi algorithm which results naturally
from the structure of the code. Its major drawback is, of course, that while error
probability decreases exponentially with constraint length, the number of code
states, and consequently decoder complexity, grows correspondingly. Partially to
avoid this difficulty, a number of other decoding algorithms have been proposed,
analyzed, and employed in convolutionally coded systems. Sequential decoding
achieves asymptotically the same error probability as maximum likelihood decod-
ing, but without searching all possible states; in fact, the number of states searched
is essentially independent of constraint length, thus rendering possible the use of
very large K and resulting in very low error probabilities. This very optimistic
picture is clouded by the fact that the number of state metrics actually searched is
a random variable with unbounded higher-order moments; this poses some rather
subtle difficulties which degrade performance. To do justice to the complex sub-
ject of sequential decoding, we must first explore more of the ensemble properties
of convolutional codes. This is done in Chap. 5, and then Chap. 6 is devoted to
sequential decoding.
Another class of decoding algorithms, known collectively as feedback decod-
ing, has received much attention for its simplicity and applicability to interleaved
data transmission. The principles of operation of a feedback decoder, or more
precisely a syndrome-feedback decoder, are best understood in terms of a specific
example; Fig. 4.17 shows the ultimately simplest rate = 4 encoder, and its trellis
and tree diagrams. Clearly, the free distance is 3, so that, if on a BSC only one
error occurs in a sequence of two branches (the constraint length), the error will be
corrected by a maximum likelihood (minimum distance) decoder. Now suppose
that instead of a true maximum likelihood decoder as described and analyzed
earlier, we use instead a truncated-memory decoder which makes a maximum
likelihood decision on a given bit or branch based only on a finite number of
received branches beyond this point. For the example at hand, suppose the deci-
sion for the first bit were based on only the two branches shown on the tree
diagram in dotted box A. If the metric at nodes a or b is greatest, we decide that
the first bit was a “0”; while if the metric at nodes ¢ or d is greatest, we decide in
favor of a “1.” Specifically, if the sequence received over a BSC is y = 100110, the
metric (negative Hamming distance) is — 1 at node ¢ and less at all other nodes at
the third branching level. This decoder will then decide irrevocably that the first
transmitted bit was a “1.” From this point, only paths in the lower half of the tree
are considered. Thus the next decision is among the paths in dotted box B and is
based on the metric at the four nodes e, f, g, and h. For the given y, the metric,
based on branches 2 and 3, at nodes e and f is — 1 and at all other nodes it is less.
Hence the second bit decision is a “0.” Note that the effect on the metric of the
* May be omitted without loss of continuity.
CONVOLUTIONAL CODES 263
>(+
(a) Encoder
00
(b) Trellis diagram
pe hed) Baa eS t 00
| 00 al
, 1
| 00 |
| | 01
0 | |
rN l fey:
| 10
|
|
| Fehon as ar rah ee oe ae ee
| | | 00 -.
; | 0) l
1 | | n :
i | | : a Fee &
. 1 | |
| l | 0) a
! Maca | Ba
| | | 10 hi
| 4 | | T
eae ee —+ ee oe |
_B CELE AT aA ant Pat Se AG epee 9
Received 10 01 10
vector
(c) Tree diagram
Figure 4.17 Code example for feedback decoding.
264 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
first branch could be removed because all paths in B have the same first branch,
this having been irrevocably decided in the previous step. The decoder can then
proceed in this manner, operating essentially as a “sliding block decoder” on
codes of four codewords of 2 branches each. This decoder is also called a feedback
decoder because the decisions are “fed back” to the decoder in determining the
subset of code paths which are to be considered next. On the BSC, in general, it
performs nearly as well'* as the Viterbi decoder in that it corrects all the more
probable error patterns, namely all those of weight (d; — 1)/2, where d, is the free
distance of the code. Hence, some (though not necessarily all) of the minimum
weight error patterns which cause a decision error with this decoder will also cause
decision errors with the Viterbi decoder.'*
The above example is misleadingly simple in that the memory length, which
we henceforth denote by L, needs only to be equal to the constraint length to
guarantee that the minimum distance between the correct path and any path
which diverges from it will be at least d, within L branches. Examining the trellis
of the rate = 4, K = 3 code of Fig. 4.8, for which d, = 5, we find that it takes L = 6
branches for all paths (unmerged as well as merged) which diverged from the 0
path at the first node to accumulate a weight equal to d, or greater. In fact, the
worst culprit is the path whose data sequence 101010 takes exactly six branches to
reach weight d,; = 5. This code then guarantees the correction of all two-error
patterns in any sequence of 12 code symbols (six branches). To correct three-error
patterns, we must have a code for which d, is at least 7, which with rat =}
requires K >5. The best K = 5, r= 4 code requires L = 12 for all unmerged
incorrect paths to reach weight 7. It turns out, however, that there is a K = 10,
r = 4 systematic code for which all paths which diverge from 0 reach weight 7 by
L = 11; hence a feedback decoder for this code with L = 11 corrects all three-
error patterns in any sequence of 22 symbols. We shall return to the question of
systematic codes momentarily.
The fact that this decoder can be regarded as a sliding block decoder can be
exploited to simplify its implementation for a BSC. We recall from Sec. 2.10 that a
maximum likelihood, or minimum distance, decoder for a systematic block code
on a BSC can be efficiently implemented by calculating the syndrome of the
received vector, and from this obtaining the most likely error vector by consulting
a table which contains the most likely error vector for each syndrome. The simple
example under consideration is a systematic code since one of its generators
contains only one tap, and hence the information symbols are transmitted
'? For the special case of the code of Fig. 4.17, Morrissey [1970] has shown that on the BSC the
feedback decoder coincides with the Viterbi decoder. However, this is the only case for which this is
known to hold.
‘3 Note that the Viterbi decoder must also truncate its memory for practical reasons, as discussed in
Sec. 4.7, but that if this is done after about five constraint lengths negligible degradation results. In the
present discussion, memory is truncated much earlier so that the decoders are generally much more
suboptimal, although not in the special case of Fig. 4.17.
CONVOLUTIONAL CODES 265
unmodified through this tap. The transmitted code vector can be written for
convenience as
Be AN 5 5 vere Mg cae Pas Pay Pps 2-9
where u; is the jth information symbol (upper generator in this case) and p, is the
jth parity (lower generator) symbol. This is generated from the data source by the
operation
where
1 ee oo
: 1 1
Note that for convenience here we have departed from the convention established
in Sec. 4.1 of writing consecutively’* all the generator outputs for the jth input,
choosing rather to partition the vector into two (generally n) subsequences, one
for each generator. This then requires that the generator matrix also be parti-
tioned into submatrices, one for each subgenerator sequence. It follows from
Sec. 2.10 then that the transpose of the parity-check matrix for this code is
Se 3
12%
e.
ee
1
H'=]1
1
1
1
1
ie
But since each submatrix of the parity check matrix also has the property that
each row is shifted from the preceding row by one term, it is clear that the
** This is merely a convention and does not alter the fact that p; is transmitted immediately after
u; and thus that the u and p subsequences are interleaved together on transmission.
266 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
u
y
Error
corrector
,
u; > ;
] + uj
.
Pen nnfertran ee eeu vex ire sere tice Siow Ser
e?
]
Pj 7
o
and-gate
Syndrome generator
| Error
Syndrome storage,
decision logic,
and error feedback
Figure 4.18 Feedback decoder for code of Fig. 4.17.
syndrome
s=yH’'
can be generated by passing the received noise-corrupted information symbols to
a single generator sequence convolutional encoder and adding its output to the
received noise-corrupted parity symbols. In Fig. 4.18 we show the information
and parity symbols as if they were on two separate channels. In fact, a commuta-
tor at the encoder output provides for consecutive transmission of u; and p;, while
a decommutator before the encoder separates the error-corrupted information
and parity subsequences. However, since the channel is memoryless, we may treat
the interleaved information and parity subsequences as if they were transmitted
on separate BSCs with the same error statistics.
Clearly, in the absence of errors, the syndrome is always zero since it is then
the sum of two identical sequences. In the presence of errors, “1”s will appear.
Returning to the “sliding block ” decoding viewpoint, we see that if we preserve L
symbols of the syndrome, then this represents the syndrome for a specialized
block code corresponding to a segment of the tree over L branches. For the 2"
syndromes of this block code, we could provide a table-look-up corresponding to
the most likely (lowest weight) error pattern for each syndrome. But, in fact, the
decision at each step of a feedback decoder is only on which half of the tree is more
likely. Equivalently, the decision may be on whether the received information
symbol corresponding to the first branch was correctly received or had an error in
it. This then requires only that the syndrome table store a “0” or “1,” the latter
corresponding to an error, which can then be added modulo-2 to the correspond-
CONVOLUTIONAL CODES 267
Table 4.2. Syndrome look-up table for code of Fig. 4.17
Syndrome Most likely error pattern Table output
S; S, Error in u, Py u, P> Error in u,
0 0 0 0 0 0 0
0 1 0 0 1 0 0
1 0 0 1 0 0 0
1 1 1 0 0 0 1
ing information symbol, which itself also had to be stored in an L-stage shift
register.
For the code under consideration the syndromes, most likely error sequences,
and required output are shown in Table 4.2. Thus in general, the look-up table
can be implemented by a read-only memory logic element with L inputs whose
single output is used to correct the information bit. In this simple case, as shown
in Fig. 4.18, the general logic element may be replaced by an and-gate. There
remains, however, one more function to be implemented in this feedback decoder.
This is to feed back the decision just made; if the decision was that an error
occurred in u,, this is most easily implemented by adjusting in the decision device
for the effect of that error. But the decision device here consists of just the stored
syndrome for the past L branches and a time-invariant table. The error in u, here
produced “1”s in both syndrome stages, but, for the next decision (on u,) the
contents of the rightmost stage is lost and replaced by the contents of the previous
stage. Thus to eliminate the effect of the u, error, we need only add modulo-2 the
decision output to the rightmost stage and store this until the arrival of the next
syndrome bit.
As long as no decision errors occur, the decoder continues to operate in this
manner, in this case correcting all single errors in any sequence of four symbols.
However, if two out of any four consecutive symbols are ever in error, a decision
error occurs that may propagate well beyond a single decision error since the
error is in fact fed back to affect further decisions. This is called the error propaga-
tion effect, which is common to some extent to all convolutional decoders. That
the error propagation in this case is finite, however, is evident from the fact that, if
no channel errors occur in two consecutive branches, both syndrome stages
cannot simultaneously contain “1”s; hence, as seen in Table 4.2, no decision error
is detected or fed back, thus returning the syndrome register to the all-zeros state
and passing the correct information symbols unmolested. The above example for
rate r = + single-error—correcting codes can be generalized to any rate (n — 1)/n
single-error—correcting code (see Prob. 4.13).
Generalization to any systematic convolutional code of any rate b/n is
straightforward. For such a code, the information sequence is subdivided into
subsequences of length b bits. Each symbol of each length b information sub-
sequence is transmitted and simultaneously inserted into the first stage of b en-
268 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
coder registers, each of which is of length K. The contents of these registers are
linearly (modulo-2) combined to generate (n — b) parity symbols after each inser-
tion of the b information symbols. The syndrome generator in the decoder then
operates in almost the same way as the encoder. That is, the b error-corrupted
information symbols are again encoded as above into (n — b) symbols, which are
added to the corresponding error-corrupted parity symbols to form syndrome
symbols. Thus for each subsequence of b information symbols, a subsequence of
(n — b) syndrome symbols is generated, which, in the absence of errors, would
be all zeros. If the truncated maximum likelihood decision is to be based on L
branches back, the (n — b)L syndrome bits must be stored and, in general, the
syndrome table-look-up must consist of 2”~”” entries of b bits each. A “1” in
any of the b table-look-up outputs indicates an error to be corrected in an
information bit and corresponding corrections to be made in the syndrome.
To illustrate a somewhat more powerful code, as well as a further
simplification in the decoder which is possible for a limited subclass of convolu-
tional codes, let us consider a two-error-correcting, r= 4 code. As discussed
earlier, the nonsystematic K = 3, r = + code of Fig. 4.2a requires a memory of
six branches to ensure that all incorrect unmerged paths reach the free distance 5
from the correct path. But the complexity of the feedback decoder depends almost
exclusively on the complexity of the syndrome logic and hardly at all on con-
straint length. Furthermore, for correcting information errors, systematic codes
are more natural to work with than nonsystematic ones.'> In general, with true
maximum likelihood decoding whose complexity depends directly and almost
exclusively on constraint length, nonsystematic convolutional codes are superior
as noted in Sec. 4.5 since they achieve higher free distance for a given K. On the
other hand, since a feedback decoder is really a sliding block decoder of block
length L branches and, as shown in Sec. 2.10, for every nonsystematic block code
there is a systematic block code with equal performance, there appears to be every
advantage to using systematic convolutional codes with feedback decoders. The
systematic code which may be feedback decoded with a syndrome memory of six
branches is shown in Fig. 4.19. It may be verified from the corresponding tree
diagram that d, = 5 and that all paths which diverge from the all-zeros at a given
'S Actually a syndrome can be calculated almost as easily for a nonsystematic code, but errors then
must be found in all the received symbols which must then be combined to generate the information
symbol.
Figure 4.19 Systematic encoder capable of two-error correction (K = 6, r = 4).
CONVOLUTIONAL CODES 269
node are at least at distance 5 from it within six branches. Thus a feedback
decoder with a memory of six branches can correct any two symbol errors in any
sequence of 12 symbols of this code.
While the general structure of this feedback decoder has already been
described, we now demonstrate a considerable simplification on the general syn-
drome lookup-table procedure that is possible for a limited class of codes of
which this is a member. Suppose we denote possible errors in the information and
parity symbols on the jth branch by e# and e?, respectively, each of which equals
zero if the corresponding symbol is received correctly and equals one if it is in
error. Then, since the syndrome symbols are all zero when no errors are made
(assuming no previous errors), the first six syndrome symbols, corresponding to as
many branches, are (see Fig. 4.20)
S,;=ef Oe}
S,=e§ ®ey
S3 = 6 © e3
Sz=eh Oey @ej
S,;=e& Oes Ges Sei
So = Be Bex Bes Oe}
Now suppose we consider the set of equations for $,, S4, S, and the modulo-2
sum S,@S-,
(4.8.1)
S,=e @e;
Sg= eh BE Oe}
S,OS;5=e§ Des Beh Be}
So = Bee Bes Ben Get
These equations have two important properties: (a) each equation contains e{, the
error in the first information symbol and (b) no other symbol error occurs in more
than one equation. Such a set of equations is said to be orthogonal on e{. Hence if
u, 1s in error so that e{ = 1 and no other symbol errors occur in the first six
branches, all the syndrome linear combinations of (4.8.2) will equal 1. If any other
error occurs alone among the first 12 symbols [or, more precisely, among the 11 of
those 12 symbols whose error terms appear in (4.8.2)] only one sum in (4.8.2) will
be 1 and the other three will be 0. If e{ = 1 and any other error occurs among the
first 12 symbols, three of the sums are 1 and the other is a 0. Finally, if e} = 0 but
two other symbols are in error, at most two sums of (4.8.2) will be 1, since each
such error term occurs in one equation (possibly both in the same equation). Thus
we conclude that e{ = 1 if and only if three or four of the sums of (4.8.2) equal 1,
and e{ = 0 if less than three of the sums equal 1. This suggests the alternate
mechanization of the syndrome table-look-up shown in Fig. 4.20. Aside from the
simplification of the general logic element, the remainder of the syndrome-feedback
(4.8.2)
270 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
el Syndrome generator
Error
corrector
u; ,
f +) > u;
Threshold device: output = 1 if 3 or 4 inputs = 1
Errors
Figure 4.20 Threshold decoder for encoder of Fig. 4.19.
decoder is as previously described. This special form of a feedback decoder is
called a threshold decoder because of the threshold logic (also called majority logic)
involved in the error decisions. The class of threshold-decodable convolutional
codes was first defined and extensively developed by Massey [1963].
Clearly, what has just been described for the first branch applies to all further
branches, with a correction to the information symbol u; performed by adding e;
to it. Also wherever e# = 1, the effect of the error is canceled by adding, in the
appropriate positions of the syndrome register, the parity-check symbols gen-
erated by a 1 in the erroneous information symbol. In this way it is clear that any
single error or pair of errors in six consecutive branches are corrected. It should be
recognized, however, that, since there are 64 possible syndromes and only
1 + (?) + (§) = 22 error patterns of weight less than or equal to 2, this decoding
procedure does not necessarily replace exactly the table-look-up corresponding to
the truncated maximum likelihood decoder. That is, while it does guarantee that
CONVOLUTIONAL CODES 271
all error patterns of weight up to (d, — 1)/2 are corrected, there may be some
three-error patterns corrected by the table-look-up procedure which are not cor-
rected here. Obviously, not all systematic convolutional codes may be decoded by
a threshold decoder. It should be clear from the above example that a code is
e-error—correctable by a threshold decoder if and only if 2e linear combinations of
syndrome symbols can be formed which are orthogonal on one information
symbol error. Then the threshold logic declares an error whenever more than e of
the linear sums equal 1. It is also clear that our original example of Fig. 4.18
was a threshold decoder for a rate r = + code with e = 1.
Another difficulty with threshold decoding is that for more than three-error
correction, the required syndrome memory length grows very rapidly. As noted
previously, there exists a systematic rate = 4 convolutional code (Bussgang
[1965]) for which all incorrect paths are at distance 7 from the correct path within
L = 11 branches, thus affording the possibility of correcting any error pattern of
up to three errors in any sequence of 22 symbols. However, it is not orthogonaliz-
able and hence not threshold-decodable. The existence of read-only memories
containing 2'? bits in a single integrated circuit makes it possible to implement the
entire table-look-up function (with an 11-bit input and a single output) quite
easily. The shortest L for r= 4, three-error—correcting orthogonalizable con-
volutional code is 12 affording correction of up to three errors in sequences of
length 24 symbols. The situation is much less favorable, however, for a r = 4
four-error—correcting code which requires that all incorrect paths be at distance at
least 9 from the correct path. Computer search (Bussgang [1965]) has shown that
the minimum value of L is 16. This code is not orthogonalizable and the table-
look-up would require a memory of 2° bits. On the other hand, the shortest
orthogonalizable, and hence threshold-decodable, four-error—correcting r= 4
code requires L= 22. L grows rapidly beyond this point with L = 36 for a
threshold-decodable five-error—correcting convolutional code. (Lucky, Salz,
Weldon, [1968].
While the feedback decoding, or sliding block decoding, concept can apply to
channels other than the BSC, its appeal decreases considerably when the binary
operations must be replaced by operations involving more elaborate metrics.
Then also, as we have noted, even for the BSC, complexity hardly justifies the
procedure for the correction of more than a few errors. On the other hand, one
rather important feature of the procedure is that it can be very simply adapted to
interleaved operation on channels with memory where bursts of errors are preva-
lent. While a general interleaver, applicable to any channel, was described in
Sec. 2.12, this was external to the encoder and decoder, constituting an interface
between the latter and the channel. A somewhat more direct approach to inter-
leaving with convolutional codes is to replace all the single-unit delay elements in
the encoder shift register by J-unit delay elements,'® where J is the degree of inter-
‘© Integrated circuits providing thousands of bits delay-line storage are common.
272 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
leaving. As a result, the encoder becomes effectively I serial encoders each
operating on one of the subsequences of information symbols u;, uj, Uj+27, ---
where i is an integer between 1 and J. This technique is called internal interleaving.
Its main drawback is that the storage required in the decoder is multiplied by I.
Thus, for Viterbi decoding, the 2°*~") state metrics and path memories must be
stored for each of the J interleaved codes, making the resulting storage require-
ments prohibitive, when J is in the hundreds or thousands. A feedback decoder,
on the other hand, consists merely of the replica of the encoder, to generate the
syndrome, and a syndrome memory shift register. Thus interleaved decoding,
corresponding to J serial decoders, can be implemented just as in the encoder by
replacing all single-unit delays by J-unit delays both in the syndrome generator
and the syndrome memory register, leaving the syndrome table-look-up or
threshold logic unchanged. Thus, for degree-/ interleaving, a constraint length K,
rate b/n, syndrome-memory L code requires (K — 1)b J-unit delay elements in the
encoder and [(K — 1)b + (L — 1)(n — b)] I-unit delay elements in the decoder.
Such decoders represent a simple approach to effective decoding of bursty channels
such as are common in HF ionospheric propagation. Several techniques for
embellishing this basic concept by varying the delays between coder and decoder
stages have been proposed (Gallager [1968], Kohlenberg and Forney [1968],
Peterson and Weldon [1972]) with moderate degrees of improvement.
4.9 INTERSYMBOL INTERFERENCE CHANNELS*
The Viterbi algorithm, originally developed for decoding convolutional codes, has
also led somewhat surprisingly to a fundamental result in the optimum demodula-
tion of channels exhibiting intersymbol interference. This phenomenon, first dis-
cussed in Sec. 2.6, arises whenever a digital signal is passed through a linear
channel (or filter) whose transfer function is other than constant over the band-
width of the signal. The narrower the channel bandwidth, the more severe the
intersymbol interference. A general model of a band-limited channel is shown in
Fig. 4.21. We first treat only the uncoded case of digital (pulse amplitude or
biphase) modulation over an AWGN channel with intersymbol interference. In
the next section and in Sec. 5.8, we shall extend our results to the coded case.
The digital signal is characterized by a sequence of impulses,'’ or Dirac delta
functions
fs eae |
u(t)= > u,d(t — kT) (4.9.1)
k=-N
* Intersymbol interference is treated in Secs. 4.9, 4.10, and 5.8 only. These sections may be omitted
without loss of continuity.
17 Tf the signal is instead a pulse train, i.e., a sequence of pulses each of duration T and amplitudes
{u,}, the channel output can still be represented by (4.9.2) with h(t) being the channel impulse response
convolved with a single unit amplitude pulse of duration T.
CONVOLUTIONAL CODES 273
n(t)
N-1 |
u(t)= 2 u,6(t-kT)
k=—N |
ux | Impulse x(t) | et ake
~ | modulator ait) ! h()
Modulator/channel : Matched filter
filter l
‘> Demodulator
(a) Analog model
n,, correlated Gaussian noise
Ux Vk
Equivalent |
digital filter
(b) Digital equivalent model
Figure 4.21 Intersymbol interference (band-limited) channel and matched filter demodulator.
where {u,} is a sequence from a finite alphabet, usually binary (u, = +1) in what
follows. The total transmission sequence is taken to be of arbitrary length 2N.
Then the first part of the channel, which is characterized by a linear impulse
response h(t), has output
x(t) = = u,h(t — kT) —0o<t<o (4.9.2)
k=—N
The additive noise, n(t), is taken as usual to be white Gaussian noise with spectral
density N,/2, and the received signal is denoted, as in Chap. 2, by
y(t) = x(t) + n(t) (4.9.3)
The decision rule which minimizes the error probability, based on the entire
received signal, is as derived in Sec. 2.2:
choose x(t) w= I, 2, ..., Mif
In (2ls=)) > 0 for all m’ +m
P(Y | Xm)
where y and x,, are the coefficients of the Gram-Schmidt orthonormal representa-
tion of the functions y(t) and x,,(t). The number, M, of possible sequences {x,,} is
bounded by M < Q?" where Q is the size of the alphabet of the components x,,,-
As shown in Sec. 2.2, assuming a priori equiprobable sequences, the log likelihood
274 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
ratio for m and m’ is given by
n (Melted) oe tel — byt
P(Y |X’) N,
i = { BA GY = seit) abs ¥ | bee — x2.(t)] dt
(4.9.4)
For this case, the integral representation of the log likelihood function is more
useful. The infinite limits of the integrals are a consequence of the fact that h(t) is
defined over at least the semi-infinite line.
Returning to the representation (4.9.2) of x(t), and letting
No
Xm(t)= Yo ty h(t -— kT) m=1,2,...,M
k=-N
it follows that the maximum likelihood decision rule can be based on
where
and
<a
=
|
I
| h(t — kT)h(t — jT) dt
— @
cee |
— | Wn | H() OF ieeter gk dw — hj, (4.9.7)
=
where H(q) is the channel transfer function, which is the Fourier transform of its
impulse response. The variables y, are the observables on which all decisions will
CONVOLUTIONAL CODES 275
be based. Note that it follows from (4.9.6) that these are formed by sampling, at
intervals of T seconds, the received waveform y(t) convolved with the function
h(—t). But this is just the output of the filter matched to the channel impulse
response h(t) when its input is y(t). Thus, the observables are just the outputs of a
matched filter. The result is similar and reminiscent of that first derived in Secs. 2.1
and 2.2 for maximum likelihood decisions in AWGN, except that here the infinite
duration channel impulse response replaces the finite duration signal.
The constants, {h;}, which depend only on the channel impulse response are
called intersymbol interference coefficients. Although, according to (4.9.7), h; = h_;
is potentially nonzero for all i, in practice, for sufficiently large i, h; ~ 0. We shall
accept this approximation to limit the dimensionality of the problem. Thus we
take
b=20... foriz. ...where £ <N (4.9.8)
Also, by virtue of the symmetry of the coefficients h;, the symmetrical quadratic
form in (4.9.5) can be written as the sum of the diagonal terms plus twice the upper
triangular quadratic form. Thus
1 iy Sees lees’. ose | ce eee | = 1 Nims
eee 7 bs Umk Um j Ny : ee > a Umk Um j Ay : eres x ur,h,
ok=-N j=-N We etn jJ=—N Nese SN
1 NA k+N
wes 2 ~ ur,h, — as Umi 2, Hm, aif i
3 eet No hacen i=
Substituting this in (4.9.5) and using (4.9.8), we have
1 = sda |
1, = zy _ (2m = Unk Ny — 2Umk 3 tm e-th
Nok i=1
N=41
y Ain Vic3 Umk > Um, k—-1> eeey Un, k-(¢—1)) (4.9.9)
| Sees 2
where we let u,,; = 0 for j < —N, and define
N oAmelYas U Umk>s Um, k— 19 +++ Um, k-(¢-1))
= Um Ve — UZH, — Zins By unit sh, {49.10)
This expression is reminiscent of the branch metric which establishes the decoding
criterion for convolutional codes on a binary-input AWGN channel. The latter,
as given by (4.2.3) restated in the present notation, is
N Ami Y«: Umk> Um, k- 19 ++ +> geet aa 2(Xmk> yx) ip, (4.9.11)
where
bt (Vers Vk29 +++ Vin)
276 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
and
Ximk bm Osiacas Xmk2 > ae eS Xmkn)
: Ximk(Umk » Um, k—1> eek: Um, k—-(K-1))
are respectively the n-symbol received vector for the kth branch and the n-symbol
code vector for the kth branch of the mth code path, whose inner product properly
scaled constitutes the kth branch metric. The last expression for x,,,, follows from
the observation that the code vector for the kth branch of a constraint K con-
volutional code depends on the present data symbol and the preceding K — 1.
Equation (4.9.10) differs from (4.9.11), the branch metric expression for convo-
lutional codes, in two respects: the “branch” observable is a scalar rather than a
vector and the expression is quadratic in the u,,,’s rather than linear in the x,,,;'s
which are algebraic (finite field) functions of the u,,,’s. However, in the most
important characteristic, namely the dependence on finitely many past data
inputs, the expressions are fundamentally the same. This then leads us to the
important conclusion that the maximum likelihood demodulation of binary data,
transmitted over an AWGN channel with intersymbol interference of finite
memory Y, can be based on a 2” ~! state trellis'!® where the states are determined
by the preceding Y — 1 data symbols. In other words, to maximize J,,, it suffices
to maximize over all paths through the 2 ~ ' state trellis whose branch metrics are
given by (4.9.10). This, of course, is achieved by the Viterbi algorithm (VA)
developed in Sec. 4.2, which we restate as follows. Given the 2*~* best paths
through branch k — 1, denoted by
Uy U2 °°" Up- gUg-(g-1) °°" Ue-1 where u,_(¢—1) °°" Uk-1
denotes one of the 2*~* binary state vectors and ii, -:: i,» are the best path
“memories ” for that state, and given the corresponding path metrics to that point,
My-1(Ux—1 °** Ue-(v—1)), the best paths to each state through branch k are
determined by the pairwise maximization
M,(u,, Uz; 45 cers U,-(¥-2))
My-1(Uuy—15 Uy, eeeg Uy,-—(¥Y—-2)> —1)
= max + As Bes 16s ess Uetae 2ys ) u,= —1,+1
M,,- 1(uy—15 Uu,—25 eeeg U,-(¥Y-2)> +1)
+ Ag (Ved Uys Uge as 6-9: aye says +4) (4.9.12)
If the upper branch of (4.9.12) is the greater, the resulting path memory symbol is
U,-(y-1) = —1 for the given state; while if the lower is greater, u,_(v-1) = +1 for
this state.
‘8 This generalizes trivially for Q-level data input sequences to a Q* ~' state trellis.
CONVOLUTIONAL CODES 277
We proceed now to evaluate the performance of this maximum likelihood
decision, based on the fact that it is implementable using the VA. Given the
correct path due to message m and another path through the trellis corresponding
to message m’ (or, equivalently, the state diagram), with corresponding metrics’? 4
and 4’ for correct and incorrect paths respectively, an error will occur if 4’ > 2.
The probability that this incorrect path causes an error is given by (2.3.10)
P,(m>m) =Pr{v>2
~o(k==I
a +) (4.9.13)
where from (4.9.2) and (4.9.7) we get
Ix—xP =| (x) — x (oP at
—
=| : oy (u, — u)h(t — Kn) dt
= ey Dx — u,)(u; — uj) le h(t — kT)h(t — jT) dt
=F YY (yuu whe, (49.14)
k=-N j=-N
Again note that x is a vector of coefficients of the Gram-Schmidt orthonormal
expansion of the signal x(t). Defining error signals
| 1 u = 1i,u,= —1
& =r(uy,—u,)= 4 0 U, = Uy
| —1 a = 2 i = 1 (4.9.15)
and noting that h,_; = h;_,, we rewrite (4.9.14) as
| aay) Sk
k=-N j=-N
=4 bs («th +2 z= athe)
j=—N
k=—N
N-1 k+N
= 4 % (<zh, +2 ¥, exer-ch
i=1
k=—-N
N-1 =e
—4 = (<th +2 > eh] (4.9.16)
k=—-—N i=1
‘9 In what follows, we avoid first subscripts m and m’' and use instead only a superscript prime to
distinguish the incorrect path from the correct path.
278 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Thus for a given error sequence € = (€_y, €-n41, ---, €05 ---» €y-1), We have
the probability of error given by
P,(e) = Pr (A’ > A)
-o fe (« dh, +2 aah) (4.9.17)
i=1
N yeoR
Using the bound Q(x) < e~*”? first used in Sec. 2.3, we have
1 Nit oe |
P,(€) < exp 5 - (ae +2) hat)
i=
ok=-N
= 1
1
—s I] exp |- a(n +25 rates) (4.9.18)
0 i=1
We use the notation P;(e) to indicate that the error probability depends on the
differences between the data symbols along the two paths. Note further that the
result depends on the sign of the differences and their locations. That is, unlike
the error probability for linear codes over binary-input, output-symmetric chan-
nels, where performance does not depend on the sign of the channel input and
hence the uniform error property holds, the situation is complicated here by the
fact that the sign of the errors and hence of the data symbols must be accounted
for. Of course, either sign is equally likely for each data symbol, and hence the
error components for each pair of (correct and incorrect) paths can take on values
0, + 1, and — 1.
Up to this point, we have examined the transmitted binary sequence u and .
any other sequence u’. Defining the error sequence
e = 3(u—w) (4.9.19)
we defined the error probability P,(€) which is bounded by the expression (4.9.18).
We now focus our attention on only those sequences u and wu’ that can cause an
error event as shown in Fig. 4.11. Equivalently, we restrict error sequences {e} to
those that begin at some fixed time and have no consecutive ¥ — 1 zeros until
after the last nonzero component. Define the number of nonzero components of €
as w(e), and note that € uniquely specifies the source sequence u in exactly w(e)
places. Over the ensemble of all equally likely source sequences, the probability
that a source sequence can have the error sequence € is 2”, and thus the
probability of an error event occurring at a given node is union bounded by
1
Ps < Y 5mm Pel€) (4.9.20)
where the summation is over all error sequences starting at a given node and
terminating when the two paths merge. This probability includes averaging over
all transmitted sequences. To determine the bit error probability at any given
node, we observe that, for any pair of sequences u and w’ with the resulting error
CONVOLUTIONAL CODES 279
sequence €, the number of bit errors associated with this error event is w(€). Hence
w(€)
Pos D 5m) P(e) (4.9.21)
Using the bound (4.9.18), the bounds on probabilities (4.9.20) and (4.9.21) can be
bounded further as
| eee | 1
1 Eee |
P; < > I] lex) exp | (oe + 2 >. het) (4.9.22)
€ k=—-N o i=1
and
Boe) |
ae !
P, <)> we) [] Swta) XP | x. (oe +2) het) (4.9.23)
€ k=-N o i=1
The evaluation of (4.9.22) and (4.9.23) is facilitated by the use of the error-state
diagram. The dimensionality of the diagram is 3*~ *, since each pair of paths at a
given node will differ in each state component by 0, + 1, or — 1, with + l and — 1
being equally likely. The all-zero error state is the initial and final state, as usual;
Figs. 4.22 and 4.23 illustrate the error-state diagram for Y = 2 and ¥ = 3. Note
that the weighting factors of (4.9.22) are accounted for by preceding the branch
transfer function by a factor of 3 if the transition involves a discrepancy (error)
between states. If the bit error probability is desired, it is for exactly these transi-
tions that a bit error is made. Hence the factor J should also be inserted on these
branches. P, is then obtained by differentiating the generating function with respect
to I and setting J = 1, just as in Sec. 4.4 for convolutional codes.
For the case 4 =2 of Fig. 4.22, in the complete error-state diagram of
Fig. 4.22d, the + 1 and — 1 states are equivalent?° and can be combined into a
single state resulting in the simpler error-state diagram of Fig. 4.22e. It follows
directly from this that
ie Ao I
~ 1—(a, +a,)I/2
T (a » 44, a2, I)
OT (ao, ay, a,; I) ao
él rer [L- (a, + 2/2)
and
7 holNo
P, 6 [1 po te” MNS p2MIN' 6 . atte tek | ts
(4.9.24)
This result is of particular interest since it applies to duobinary transmission,
20 There is no way to distinguish these states by observing branch values.
280 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
tose sce
hy ho hy
Ny
~G) +4) Oi
(a) Digital equivalent model for £ = 2
Umk
ys
Be,
(b) Branch metric generator
ne le, ae
2
y
>
, a
2
e otk +2hyexKexK—1)/No Branch
Y metric
eC) p(X)
(c) Branch metric generator for error-state diagram
CONVOLUTIONAL CODES 281
ao = eo/No
a = eho+2h1)/No
a> = e'o-2h1)/No
a,1/2
(d) Error-state diagram for bit error computation
a, +a,)1/2 Ag!
ie T(dg, 41, 4; 1) =
1—(a, +a,)1/2
ag l 1
0 P< oT ao
ae al, U-@ +a)/27
>
oO
oO
(e) Reduced error-state diagram and bit error bound
Figure 4.22 ISI channel example for Y = 2.
where each transmitted signal is made a pulse of double width, that is
i= Gemis BererT
ee otherwise
For this case, it is easily verified from (4.9.7) that
h,=€
a, = sf/2
h, = 0 for alli >2
Thus
e No 4e~é/No
P, < [1 nak fem €1N0( ef /Ne “ ® e *iNeyi2 - (1 a Malan 3
Of particular importance is the fact that the asymptotic exponent is not degraded
relative to the case without intersymbol interference.
The error-state diagram shown in Fig. 4.23d for the Y = 3 case has four
equivalent pairs of states which can be combined to form the simpler reduced
error-state diagram shown in Fig. 4.23e. This type of reduction always occurs so
that in general the error-state diagram for intersymbol interference of memory
length ¥ has a reduced error-state diagram of (3*~ ' — 1)/2 nonzero states.
(correlated
Gaussian
noise)
Vk
Umk a is Py
Yk ' |
X
gts gee)
(b) Branch metric generator
Nl
(3) ""
Branch
metric
Y
a ae
(c) Branch metric generator for error-state diagram
Figure 4.23 ISI channel example for ¥ = 3.
282
00
1)
= eh o/No
= e tot+2h1)/No
e (Ao-2h 1)/No
= p-(hot2hy +2h3)/No
= e (ho-2h\-2h2)/No
= ¢(hg-2h1 +2hy)/No
= p-(hg+2hy-2h7)/No
= e-(tg+2h2)/No
e 2o-2h2)/No
(d) Error-state diagram for bit error computation
(e) Reduced error-state diagram
283
284 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
We have treated here the case of a known modulator/channel filter followed
by additive white Gaussian noise. The output of the filter is modeled as some
linear weighting of a finite number of data symbols, and it can be viewed as a “ real
number convolutional code” which we decode with the Viterbi algorithm. Except
for the error-state diagrams, there is nothing special about linear intersymbol
interference channels of this type. Just as the Viterbi algorithm is a maximum
likelihood decoding algorithm for arbitrary nonlinear trellis codes, it can also be
applied as a maximum likelihood demodulator for data sequences that enter any
channel which consists of an arbitrary (possibly nonlinear) but noiseless finite
memory part followed by a memoryless noisy part. The noiseless finite memory
part of the channel acts like a trellis code which the Viterbi algorithm demodula-
tor decodes in the presence of additive memoryless noise.
4.10 CODING FOR INTERSYMBOL INTERFERENCE
CHANNELS*
Considering the commonality in structure of the optimum decoder for convolu-
tional codes and the optimum demodulator for intersymbol interference channels,
it is reasonable to expect that a combined demodulator—decoder for coded trans-
mission over intersymbol interference channels would have the same structure as
each component. This is readily shown by examining the intersymbol interference
channel model of Fig. 4.21 preceded by a convolutional encoder, and the
modifications this produces in Eqs. (4.9.2) through (4.9.10).
With coding, the channel output signal, prior to addition of the AWGN, is
bf Vogl €
x(t)= > x,h(t — kT) (4.10.1)
k=—-Nn
where x, is now the kth code symbol and hence, for a constraint length K convolu-
tional code, it depends on K binary data symbols for a rate 1/n code, or on K
b-dimensional binary data vectors for a rate b/n code. Thus the 2Nn code symbols
{x,} are generated from the 2N data vectors u; by the expression
X= Va+k—nikin Uns Wainy 19 «+> Wein j—(K-1)) —Nnsk<Nn—t1
(4.10.2)
where |v] is the greatest integer not greater than v and u; = 0 for i < —N. For
a rate 1/n code, the data vectors u; become binary scalars and the , function is the
scalar projection of the vector formed from the terms of the jth tap sequence of the
code (9o;. 91;> ---» 9x1, ; in Fig. 4.1). For a rate b/n code, the u; are b-dimensional
binary vectors and the y, function is the corresponding matrix operation on
the data matrix (e.g., in Fig. 4.2b and c, this matrix is formed from the jth rows of
the matrices gy and g,).
Upon replacing (4.9.2) by (4.10.1) and (4.10.2), the remainder of the derivation
* May be omitted without loss of continuity.
CONVOLUTIONAL CODES 285
of Sec. 4.9 proceeds as before with u,,, replaced by x,,,. Thus (4.9.10) becomes,
upon dropping the first subscript m for notational simplification
A es |
NoAnlVas Xie> Xk—19 +++ Xk—-(e—1)) = 2XVe — 4 eee 74 x: x,-;h; (4.10.3)
i=1
But from (4.10.2), we have tnat x, depends on
Uikinj> Wkinj—1> +--+ > Wkinj-(K-1)
and hence similarly, for i= 1, 2,..., & — 1, x,_; depends on
Uick—iyinj> Wk—-iyynj—1>++-> Wa-iyinj-(K-1)
Thus the kth branch metric of (4.10.3) can be written as the function of the
[(Y — 1)/n]+ K data vectors
Wik/nj> Wkinj—1> ---> WaK-(¢-1))/nj-(K- 1)
(where [v] denotes the least integer not less than v) by substituting (4.10.2) with
the appropriate index for each term x, and x,_; in (4.10.3).
Thus, maximizing (4.10.3) over all possible data paths {u,} is exactly the same
problem as maximizing (4.9.10) for uncoded intersymbol interference channels, or
(4.9.11) for coded channels without interference. The only differences are that, for
those cases, the state vectors are of dimensions Y — 1 and K — 1, respectively,
while here their dimension is [(¥ — 1)/n]+ (K — 1), and the functions which
define the branch metrics are somewhat more elaborate, being a composite of the
previous two.
However, once the branch metrics are formed, the maximizing algorithm is
again the VA, exactly as before. Thus, the algorithm is again expressed by (4.9.12)
but with Y — 1 replaced by [(¥Y — 1)/n]+ (K — 1). We conclude thus that the
optimum demodulator—decoder for coded intersymbol interference channels is no
more complex, other than for dimensionality, than the corresponding uncoded
channel.
Unfortunately, however, the calculation of error probability is greatly com-
plicated here, even though the error probability development in Sec. 4.9, begin-
ning with (4.9.13) and leading to the expression (4.9.18) for P,(€), can proceed with
u, and u, replaced by x, and x;,, respectively, and ¢, =4(u, — u,) of (4.9.15)
through (4.9.18) replaced by
But the difficulty arises when we attempt to average over all possible error events
as in (4.9.20). For, in the uncoded case, all error sequences € are possible and,
given that the data symbol is in error (€, + 0), it is equally likely to be + 1 or —1.
This is not the case for coded transmission. First of all, not all error sequences €
are possible; for if the correct and incorrect path code symbols are identical in
the kth position, then
286 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
a condition dictated by the code. To make matters worse, if ¢, #0, it is not
necessarily equally likely to be +1 or —1; this too depends on the code.
In principle, an expression similar to (4.9.20) can be written in the form
Pe< YY fq x)Pe€=4x-x)) (4.10.5)
all correct all incorrect
paths paths
where f(x, x’), the distribution function, is dictated by the code and by the fact
that all data sequences u are equally probable. While this calculation can be
carried out with considerable effort in a few very simple cases (Acampora [1976]),
it provides little insight into the general problem. Using ensemble average
techniques to be developed in Chap. 5, we shall obtain in Sec. 5.8 some rather
general and revealing results on the effect of intersymbol interference on the
ensemble of time-varying convolutional codes.
Before concluding, however, it is worth noting that some simplification is
possible in the branch metric expression whenever n> Y — 1; for the total
memory then becomes
(L—-1ynl+(K-1)=K n>Z-1 (4.10.6)
and the branch metric function given by (4.10.3) with (4.10.2) can be expressed
functionally as
NA (Yes UWikinj> Wkinj-1> ++ +> Ukinj-K) (4.10.7)
Now, while it would appear that the condition n > Y — 1 is overly restrictive, this
is not at all the case. For suppose Y — 1 = 3, and the code rate b/n = 1/2; without
any change in the code implementation, we may treat it as if it were a rate
b/n = 2/4 code and thus achieve the desired condition for (4.10.6). Of course, the
data vectors are now two-dimensional rather than scalar, but all code representa-
tions from shift register implementation to state diagram can be redrawn in this
way without changing the code symbols generated and consequently the perfor-
mance in any way. For the code itself (not considering the intersymbol interfer-
ence channel), the state vector dimensionality is the same as before but the
connectivity of the state diagram increases; yet the generating function does not
change in any way (see Prob. 4.26), and thus it is clear that all that has changed is
the representation.
4.11 BIBLIOGRAPHICAL NOTES AND REFERENCES
The concept of convolutional codes was first advanced by Elias [1955]. The first
important decoding algorithm, known as sequential decoding, was introduced by
Wozencraft [1957] and refined by Reiffen [1960]. This material, and the later more
CONVOLUTIONAL CODES 287
efficient algorithm due to Fano [1963], led to an important class of decoding
techniques which will be treated in Chap. 6. The material in this chapter, while
chronologically subsequent to these early developments, is more fundamental
and, for tutorial purposes, logically precedes the presentation of sequential decod-
ing algorithms.
Sections 4.2 through 4.6 follow primarily from three papers by Viterbi [1967a],
[1967b], and [1971]. The last is a tutorial exposition which contains most of the
approach of Secs. 4.2 through 4.5 and 4.7. The material in Sec. 4.6 appeared in the
second of the above papers. The so-called Viterbi algorithm was originally pre-
sented in the first paper as “a new probabilistic nonsequential decoding algor-
ithm.” The tutorial exposition [1971] appeared after the availability in preliminary
form of some most enlightening clarifications by Forney (which later appeared in
final form in Forney [19725], [1973], and [1974]); to this work, we owe the concept
of the trellis exposition of the decoder. In the same work, Forney also recognized
the fact that the VA was a maximum likelihood decoding algorithm for trellis
codes. Omura [1969] first observed that the VA could be derived from dynamic
programming principles.
The state diagram approach as a compressed trellis and the generating func-
tion analysis first appeared in Viterbi [1971]. The first code simulation which led
to the recognition of the practical value of the decoding algorithm was performed
by Heller [1968]. The code search leading to Table 4.1 was performed by Oden-
walder [1970].
Feedback decoding traces its conceptual roots to the threshold decoder of
Massey [1963]. Important code search results which revealed properties of convo-
lutional codes necessary for feedback decoding appeared in Bussgang [1965]. The
exposition in Sec. 4.8 follows primarily from the work of Heller [1975].
The important realization that maximum likelihood demodulation for
intersymbol interference channels can be performed using the VA is due to Forney
[1972a]. The development of Sec. 4.9 follows this work conceptually, although the
derivation is basically that of Acampora [1976]. Section 4.10 on combined demod-
ulation and decoding for intersymbol interference channels follows the work of
Omura [1971], Mackechnie [1973], and Acampora [1976].
PROBLEMS
4.1 Draw the code tree, trellis, and state diagram for the K = 2, r = 4 code generated by
Data > : Code
——----4
fi
mA Figure P4.1
288 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
4.2 Draw the trellis and state diagram for the K = 3, r = 3 code generated by
Data Code
Figure P4.2
4.3 Draw the state diagram for the K = 4, r = 4 code generated by
Data Code
Figure P4.3
4.4 Draw the code tree, trellis, and state diagram for the K = 2, r = 3 code of Fig. 4.2c.
4.5 Given the K = 3, r=4 code of Fig. 4.4 of the text, suppose the code is used on a BSC and the
received sequence for the first eight branches is
00 01 10 00 00 00 10 01
Trace the decisions on a trellis diagram labeling the survivor’s Hamming distance metric at each node
level. If a tie occurs in the metrics required for a decision, always choose the upper (lower) path.
4.6 (a) Solve for the generating function (in D only) of the labeled state diagram of Fig. 4.6 of the text
and show that the minimum distance between paths is 3.
(b) Repeat for the K = 2 code of Prob. 4.1 and show that the minimum distance between paths
is 3.
(c) Repeat for the K = 4 code of Prob. 4.3 and show that the minimum distance between paths
is 6.
4.7 Determine the node error probability bounds and the bit error probability bounds for all codes of
Prob. 4.6 for a binary-input, output-symmetric channel for which z is known.
4.8 Verify inequality (4.5.5) of the text.
4.9 It is of interest to determine the maximum value of the free distance for any fixed or time-invariant
code of a given constraint length and rate. The following sequence of steps leads to an upper bound,
nearly achievable for low K, for rate 1/n codes. Consider the rate 1/n fixed convolutional code whose
generator matrix is given by (4.1.1) with all rows shifted versions of the first row.
(a) Show that for any binary linear code, if we array the code in a matrix, each of whose rows is a
code vector, any column has either all zeros or half zeros and half ones.
(b) Consider the set of all finite-length data sequences of length no greater than k. Show that the
CONVOLUTIONAL CODES 289
code generated by these finite-length data sequences has length (K — 1 + k) branches, or (K — 1+k)n
symbols, and show that the average weight (number of “ 1”s) of all codewords (excluding the all-zeros)
is no greater than
2-'(K —1+k)n
2*-1
w,y(k) <
(c) Using (b) show that the code has minimum distance between paths (free distance) d, < w,,(k)
for any k.
(d) Let k vary over all possible integers and thus show that (Heller [1968])
1K —1+k
d, <min ol a
That is, for small K and n= 2, r = 4 this yields
Upper bound on d,,.. Achievable
K (integer) (noncatastrophic)
2 4+ 3
3 5 5
4 6 6
5 8t 7
+ Achievable with catastrophic code.
4.10 (Van de Meeberg [1974]) For a BSC where Z = J 4p(1 — p) show that (4.4.4) can be replaced by
oa “a when d is odd
This can be shown by examining the decision region boundary and the Bhattacharyya bound on
decoding error when d = 2t and when d = 2t — 1. Using this show that
P.(j) < 2[(1 + Z)T(Z) + (1 — Z)T(—Z)]
can replace (4.4.7) when we have a BSC.
Hint: First show that for two codewords at odd distance d, the decoding error probability
(maximum likelihood) is the same as for two codewords at distance d + 1.
4.11 Consider a rate 1/n fixed binary convolutional code and define the code generator polynomials
gi(Z) =1+ 91,i2 + ra sd 5 oe MOE A aac i= 1, yA meee, |
Show that this convolutional code is catastrophic if and only if all the n generator polynomials have a
common polynomial factor of degree at least one. Use the fact that for a catastrophic code some
infinite-weight information sequence will result in a finite-weight code sequence.
4.12 Use the result of Prob. 4.11 to show that, for rate 1/n fixed binary convolutional codes the relative
fraction of catastrophic codes in the ensemble of all convolutional codes of a given constraint length is
1/(2" — 1), which is independent of constraint length.
4.13 For any integer n, find a rate (n — 1)/n single-error—correcting code and its feedback decoding
implementation, which is a generalization of Figs. 4.17 and 4.18.
4.14 Consider a rate 1/n convolutional code with constraint length K. Let a(/) be the number of paths
that diverge from the all-zeros path at node j and remerge for the first time at node j + K + l.
290 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(a) Show that
1 1=0
7: isi<ck
a(l) =
a>
(b) Directly from (a) prove that
T,(L) = ¥ affe**'
=0
_ IK(1-L)
—1-2L4+ 45K
(c) Noting that a(l) is the number of binary sequences of length | — 1 which do not have K — 1
consecutive “0”s, show that
2-1 > a(l) > 2-41 — pk
Hence when /2-‘*~") < 1, we have the approximation
all) = 2'-*
For most codes with large constraint lengths K, this approximation would be valid for all values of /
that contribute significantly to error probabilities.
4.15 Given a K = 3, r= 4 binary convolutional code with the partially completed state diagram
below, find the complete state diagram and sketch a diagram for this encoder.
Figure P4.15
4.16 Suppose the K = 3, r = 3 convolutional code given in Prob. 4.2 is used over a BSC. Assume
that the initial code state is the (00) state. At the output of the BSC, we receive the sequence
y = (101 010 100 110 011 rest all “0”)
Find the maximum likelihood path through the trellis and give the decoded information bits. In case of
a tie between any two merged paths, choose the upper branch coming into the particular state.
CONVOLUTIONAL CODES 291
4.17 In Fig. 4.10 let €,, €., and €, be dummy variables for the partial paths to the intermediate nodes.
Let
os
=] ¢.
Ca
and write state equations of the form
§=AG+b
Find A, a 3 x 3 matrix, and vector b. & can be found by
& =(I-— A) "'b
where I is the 3 x 3 identity matrix. Solve this to find T(D, L, I) and check your answer with (4.3.3).
4.18 In Prob. 4.17 consider the expansion
(I— A) '=1+A+A?+A°+A*4+--
(a) Use the Cayley-Hamilton theorem to show that for L = 1 and J = 1
A> = DA?+DA
Hint: The Cayley-Hamilton theorem states that a matrix A satisfies its characteristic equation
p(a) = |A — Al| =0
(b) Use (a) to find (I — A)~' by the above expansion, and then find T(D) for Prob. 4.17.
(c) Show that terms in A* decrease at least as fast as D*’?.
(d) Repeat (a) and (b) for arbitrary L and J.
4.19 Given the K = 3, r = 4 code of Fig. 4.4 of the text, suppose the code is used on a BSC and the
transmitted sequence for the first eight branches is
14.01 81 8G. £9 43 80
Suppose this sequence is received error free, but somehow the first bit is lost and the received sequence
is incorrectly synchronized giving the assumed received sequence
10..3C O° 01 U2 te
This is the same sequence except the first bit is missing and there is now incorrect bit synchronization
at the decoder. Trace the decisions on a trellis diagram, labeling the survivors Hamming distance
metric at each node.
Note that, when there is incorrect synchronization, path metrics tend to remain relatively close
together. This fact can be used to detect incorrect bit synchronization.
4.20 Suppose that the Viterbi decoder uses only integer branch metrics j€ {+i,, +i,, ..., +iy,/2},
where J is even, giving rise to a channel with input 0 and 1, transition probabilities Po(j) with
Po(j) => Po(—J) for j > 0, and P,(j) = Po(—j). Let
+ iz/2 N
ny)" Pozi” and © “AG)= "Faz,
$2 Sta3 k=-N
and define
ae,
{A(z)}}_ = 1/2ag9 + Y¥ a
292 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(a) Show that the pairwise error probability for an incorrect path at Hamming distance d from
the correct path upon remerging is exactly
Pa = {{(z)}*}-
(b) If the code-generating function is T(D, I) and
fi
dl I=1 d=dy
show that
(c) In (a) show that
where Z = min I(z)
and using (b) show that
dT
p, - 1 4T(D,1)
ay dl 1=1, D=Z
(d) For the BSC, show that Z = Z.
4.21 For the DMC with input alphabet 2, output alphabet ¥, and transition probabilities
{p(y|x): y € Y, x © ¥}, define the Bhattacharyya distance between any two inputs x, x’ € ¥ as
d(x, x’) = —In Y \/p(y|x)p(y|x’)
For two sequences x, x’ € 2% define the Bhattacharyya distance as
Show that, for any two diverging and remerging paths of a trellis whose Bhattacharyya distance is d,
(4.4.3) generalizes to
) en En
and (4.4.5) generalizes to
Pj) < ) a(dje*
d
where a(d) is the number of paths of Bhattacharyya distance d from the transmitted path. What is
necessary to be able to define generating functions that generalize (4.4.7) and (4.4.13)?
4.22 Consider the r = 5 convolutional code of Prob. 4.2. Suppose each time an information bit enters
the register, the three code bits are used to transmit one of eight orthogonal signals over the white
Gaussian noise channel. At the output of the channel, a hard decision is made as to which one of the
eight signals was sent. This results in a DMC with transition probabilities
bp, Pex
a
COTS lot vee
CONVOLUTIONAL CODES 293
Following the suggested generalization of Prob. 4.21, find the generating function T(D, J) and give the bit
error bound of (4.4.13). Repeat this problem when the outputs of the channel are not forced to be
hard decisions.
4.23 Show that the bound in (4.6.15) can be made tighter by
IF(1 — L) 1 |
T,(L) = 1 SE
1 yd >
Bey eal St ee gee |
k=0
4.24 In Fig. 4.9 let
€,(D, t) = generating function for all paths that go from state
a to state x in exactly t branches.
x=b cd
Let
C4(D, t)
&(D, t) = | ¢.(D, t)
E4(D, t)
(a) Show that
E(D, t + 1) = A&(D, t)
and find A.
(b) Find
> &(D, t)
t=1
and show that
T(D) = [0 D? 0] &(D, t)
t=1
(c) Suppose we have a BSC with crossover-probability p and in the Viterbi decoder we truncate
path memory after t branches and make a decision by comparing all metrics of surviving paths
leaving the given node at the truncation point. Metrics are computed only for the t branches following
the truncation point. Show that the probability of a node error at node j is bounded by
; Se |
P{j, t) <[0 D? 0] p3 E(D, t) + [1 1 1J&(D, t)
where D = ./4p(1 — p). Give an interpretation of each term in the bound. Note that &(D, r) is easily
found recursively with initial condition
D?
ELD, I}=1 0
0
(d) Obtain a closed-form expression for the bound in (c) for t = 7.
Hint: Use the result of Prob. 4.18(b).
4.25 Consider a binary convolutional code, a BSC with parameter p, and a feedback decoder of
memory length L. In the usual state diagram of the convolutional code, label distances from the
all-zeros path and define
¢.(D, t) = generating function for all paths that initially leave the all-zero state and go to state x in
exactly t branches (return to the all-zeros state is allowed)
294 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(a) Show that the probability that the decoded path leaves the correct path at node j, P,(j, L), is
bounded by
P.(j, L) < ) ¢,(D, L)
D=V/4p(1- p)
where the summation is over all states including the all-zeros state.
(b) Evaluate a closed-form expression for the bound in (a) for L = 6 and a code with the distance-
labeled state diagram below:
Figure P4.25
and use the Cayley-Hamilton theorem [see Prob. 4.18(b)].
4.26 Treat the K = 3, r = 4 code of Fig. 4.2a as if it were an r = 2 code; ie., define a branch as the
four symbols generated for every two data bits. Draw the corresponding state diagram and determine
T(D, I). From this, compute the upper bound on P, using (4.4.13) and verify that the result is the same
as computed directly from the original state diagram for the r = 5 code.
4.27 Show that the noise components of the matched filter output in Fig. 4.21 have covariance
N,
E[n,n,] = 5: i hy - j
Instead of the matched filter, assume that the suboptimum “ integrate and dump ” filter is used. That is,
assume that the observables are
jn= | ylt)p(t-kT) dt k= -N,-N4+1,...,N-1
=o
CONVOLUTIONAL CODES 295
where
Gxr<T
1
P(t)= { ./T
0 otherwise
Show that the maximum likelihood demodulator based on observables j_y, V_y44, ---» Vy—1 IS
realized with the Viterbi algorithm with the bit error bound analogous to (4.9.23) given by
a4 1 1 ft 8 Z
P,<) we) [] wa) ©XP I-55 3 fies) |
€ k=-N o\i=0
where
i; =| h(t — jT)p(t — kT) de
For ¥ = 2, give the state diagram, determine the transfer function, and find the generating function
for the bit error bound in terms of fy and hy.
4.28 (Whitened Matched Filter, Forney [1972a]) Consider the intersymbol interference example of
Fig. 4.22 where ¥ = 2. Suppose the matched filter outputs {y,} are followed by the following digital
filter with outputs {y,}.
So¥k +h Vea. =Ve
Figure P4.28
Here we choose f, and f, to satisfy
fo +fi = ho
Sofi =hy
The matched filter combined with this transversal filter is called a whitened matched filter.
(a) Show that the outputs {},} are given by
Ve =Souy +h ue. + %
where
an ee
(b) Show that the maximum likelihood demodulator based on observables {},} is realized with
the Viterbi algorithm, and give the error-state diagram for this case.
(c) Show that the bit error bound based on the error-state diagram in (b) is also given by (4.9.24).
(d) Generalize the above results to arbitrary Y. First define
> ghent | :
H(D)= ¥ h,Di
i=-(#-1)
296 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
and show that there exists a polynomial of degree Y — 1
MN od | :
AD= IP
i=0
such that
H(D) = f(D) f(D" *)
Show that the transversal filter with inputs {y,} and outputs {},} that satisfy the difference equation
Pager |
3 LV n+ eae a
i=0
result in outputs satisfying the form
where
eit N,
E[ii, i; = ry Oxj
J
(e) Describe the error-state diagram when using the whitened matched filter in (d) and derive the
bit error bound
Nt
1 se :
Py< we) TT seep XP - | d fier) |
€ N visti
(f) In (d), show that
[e.2)
v-1
oy (<2 +2) caéa-ahs]
n k=1
4.29 (a) For the rate r= 2~* orthogonal convolutional encoder shown, consider a noncoherent
demodulator on each branch with linear combining so that path metrics are formed as the sums of the
branch metrics z(m). Using techniques of Secs. 2.12 and 4.6, show that the probability of error caused
by an incorrect path merging after J unmerged branches is bounded by
J J
P= Pr| 24> b 3,
j=l j=1
J
< max |] Ele'~*)]
O<p<1 j=1
a
where
z= max sex |( 757 ba
= max exp 11 ———__ —
beens 1—p? 1+ p/N,
(b) From this, derive the bound on the bit error probability (Viterbi and Jacobs [1975]).
zk
P et Mee ee
ir ZF
a.
Y y
Select one of 2
orthogonal-frequency signals
n(t)
CONVOLUTIONAL CODES 297
K
eth Se sin| 7k + Kaye +9
an
(Gj-1)T<t<jT
x, (t) y(t)
2K noncoherent
demodulators
on each branch
z;(0)
Y
y ’
Viterbi decoder
iT
z;(m) = | f- y(T) exp
(-1)T
Figure P4.29
Ae
ni(m + sola
ix
P20 i. 5. 5 24-1 Ko integer
A Cage &
v
rea
2
&6.4, 47-1
1.30 (a) For the same noncoherent channel and demodulator—decoder as in Prob. 4.29, show that the
‘ate r = 4 quaternary code generated by the K = 5 encoder shown above yields
| ear
1 + Za(Z)
1 + Zb(Z)
where a(Z) and b(Z) are polynomials in Z with integer coefficients.
Figure P4.30
Select
one of four
orthogonal- ,-————~>
frequency
signals
298 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(b) Generalize the results to a rate r = 2~* code of constraint length K = 2k + 1 and a 2¥-ary
orthogonal signaling alphabet where k is any integer. This has been called the class of semiorthogonal
convolutional encoders.
4.31 Suppose the K = 3, r= 4 convolutional code shown in Fig. 4.2a is used over the intersymbol
interference channel with # = 2 shown in Fig. 4.22a. Assume the whitened matched filter (see
Prob. 4.28) so that the discrete model for the system becomes
aie a3 a 7
0> +1 Mk
Pes eae | |
|
| Convolutional encoder Intersymbol interference
bs
Finite-state machine
Figure P4.31
Here, the @ in the convolutional encoder is a modulo-2 sum and u is the binary data sequence with
symbols from {0, 1}. The summer in the intersymbol interference is a real sum. For each binary symbol
into the convolutional encoder, two coded symbols from {—1, 1} enter the intersymbol interference
channel and there are two corresponding outputs of the channel.
(a) Regard the combined convolutional encoder and intersymbol interference as a single finite-
state machine with binary inputs {u,} and pairs of outputs from {z,}. Defining the state of the system as
the binary sequence (a, b, c) shown as the contents of the encoder register, sketch the state diagram for
the device with pairs of the outputs (z,, z,,,) on the branches from state to state.
(b) Suppose the transmitted data sequence is u = 0. Consider another data sequence u’ where
uj, = O,9- That is uy = 1 and u, = 0, k # 0. What is the pairwise error probability P,(u— u’)?
(c) Assuming transmitted data sequence u= 0, construct a state diagram which will give a
generating function with which we can bound P;(u) and P,(u). Express the generating function in terms
of vectors and matrices.
4.32 Consider a channel with memory #=2, input alphabet 2% = {0, 1}, output alphabet
Y = {a, b, c, d} which consists of a noiseless memory part followed by a DMC as shown.
ae re Ree ens
Xe 7A €
ke re DMC vKeey
Tite, hea? PC y\|z)
-----++---
Channel Figure P4.32
CONVOLUTIONAL CODES 299
Here
a x, = 0, x,-, =0
— we b x, =0,x,-, =1
2, =f (xX,. %-1) = c x, =1,x,., =90
d x, =1,x,-, =1
F=9={ab,c,d
and
a b eS d
al q° pq Pp pq
{P(y|z)}=b] pq q°> pq P e=*- 2-0
c] p> pq q pq
dj pq p> pq q
(a) Assume x, = 0 and equiprobable binary data symbols x,(k = 1, 2, ...) are sent over the
channel. For the channel output sequence
hs Merk y2=5 ¥sa 4c yg=b y= a k>5
determine the maximum likelihood data sequence x,(k = 1, 2, ...).
n(n
2
(b) Determine the union-Bhattacharyya bound on P,(x), the bit error probability when x = 0 is
sent.
4.33 (Unknown Intersymbol Interference Channel) For the linear channel of memory ¥ and the
suboptimum “integrate and dump” filter discussed in Prob. 4.27, determine the performance degrada-
tion when the Viterbi demodulator is designed under the mistaken impression that the impulse-
. Pont -
response is At) rather than h(t), the true channel impulse-response. Here, define
Hint: Consider branch metric —
iy-;=| f(t —jT)p(t — kT) dt
and h,_; as in Prob. 4.27.
(a) Show that, if the demodulator is realized with the Viterbi algorithm designed for h(t), then the
bit error bound is given by
tm |
P,<> >’ wuw)[] =e 4"
uuu k=1 2
where )”,.4, is the summation over all incorrect sequences u’ which diverge from the correct sequence
u at some fixed initial time and remerge with it later, w(u, u’) is the weight of the error sequence
between u and u’,, and
ores 2 id ae ee
R, = ( >. h(u,—; — ud) +4 ry >: hh; is hj (uj —; a Uy, — Uy;
i=0 i=0 j=0
Assume that h(t) and A(t) are zero for t > YT.
(b) For evaluation of the bit error bound using a generating function, we need a state diagram
in which each state consists of a pair of states S and S’, where S is the correct state and S’ is an incorrect
state. Initial and final states are states in which S = S’. There are 2% ~' initial and 2*~' final states.
Introduce an initial dummy state and a final dummy state and note that there is probability 2~ “~ !) of
transition from the initial dummy state to each initial state. For Y = 2, consider the pair state diagram
as shown below and find all transitions and the transfer function. Note that this state diagram can be
reduced.
300 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Figure P4.33
(c) Show that the bit error bound is given by
eke . =) ae _ =) fo(2hy — ho) + h,(2h, — A
cosh | ———————— ] cosh | ——~———_ J exp { —
N, N, N,
ra = ~aA ~ = ~ a ~ Ds ~ ~ bes ~
[ — cosh (Hot il - 2 rot) exp ee ho(2ho — se h,(2h, — “oly
CHAPTER
FIVE
CONVOLUTIONAL CODE ENSEMBLE
PERFORMANCE
5.1 THE CHANNEL CODING THEOREM FOR TIME-VARYING
CONVOLUTIONAL CODES
This chapter treats for convolutional codes the same ensemble average error
bounds which were studied in Chap. 3 for block codes. However, useful tight
bounds can be found only for time-varying convolutional codes, corresponding to
the matrix (4.1.1) with g® and g*~” not necessarily equal. (For a fixed convolu-
tional code, each row is a shifted replica of every other row.)
For any convolutional code, we have from (4.4.1) and (4.4.2) that the node
error probability at the jth node of a maximum likelihood decoder employing the
Viterbi algorithm is bounded by
P(j) < Pr |) [AM(xj, x;) = 0]| anid < Pr{iAMG@,, xj2 0). 611)
lay ex) ] xj, €X'(j)
where x; is any incorrect path stemming from node j, 2"(j) is the set of all such
incorrect paths, x, is the correct path after node j, and AM(x;, x;) is the difference
between the metric increment of incorrect path x; and correct path x; over the
branches of their unmerged span.
For rate 1/n (binary-trellis) convolutional codes, we determined in Sec. 4.6 the
structure of all paths through the trellis. In particular, the bound (4.6.15) indicated
that for a constraint length K code, there are less than 2* paths which diverge from
the correct path at node j and remain unmerged for exactly K + k branches. This
conclusion can be arrived at alternatively by the following argument. Without loss
of generality, since a convolutional code is linear, we may take the all-zeros path
301
302 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
to be the correct path. Then any incorrect path which diverges from the correct
path at node j and remains unmerged for K + k branches must have binary data
symbols
by tg 3s Rie ds 6 Ma De
|-K—1-| (5.1.2)
where u;,4, ---, Uj+,—1 1S any binary vector containing no strings of more than
K — 2 consecutive zeros. While the exact number of such incorrect paths is best
computed by the generating function technique of Sec. 4.6, 2" is an obvious
upper bound on the number of such paths. We shall concentrate on rate 1/n codes
initially, and later generalize to rate b/n.
Now, as was done for block codes in Chap. 3, we average this error probabi-
lity bound over all possible codes in the ensemble. We begin by noting that each
term of the sum in the rightmost bound of (5.1.1) is a pairwise error probability
between the correct path x, and the incorrect path x; over the unmerged segment
of K + k branches, where k > 0. Using the Bhattacharyya bound (2.3.15) for each
such term, we have
Pr [AM(x;, x;) > 0] < ) \/pw(y|x;)pw(y |x;) (5.1.3)
where N = (K + k)nis the number of symbols on the unmerged segment of length
K +k branches and y is the received vector for this unmerged segment.
We must average over all possible values of x; and x; in the ensemble of
time-varying convolutional codes. Suppose, as for block codes, that the channel
input alphabet is Q-ary and that the time-varying convolutional code is generated
by the operation (Fig. 5.1)
tes > Ur, gi, B Vo, i
k=i-K+1 : (5.1.4)
x; = L(V;)
where g), g",..., g@_ , are time-varying binary connection vectors of dimension 1;
U;—-K+1> U;-K+2>-+--5 U; are binary data symbols; vy; is the ith binary branch vector
with | symbols (where / is a multiple of n) and vo ; is an arbitrary binary branch
vector with the same dimensionality as v;. Here vy ; plays the same role as in the
linear block code ensemble of Sec. 3.10, and is required for nonbinary and asym-
metric binary channels. Y(v;) is a memoryless mapping from sequences v; to
sequences x; of n Q-ary symbols (Q < 2’) (see Fig. 5.1 with b = 1).
The mapping ¥Y must be chosen carefully, particularly to ensure that the
ensemble over which averages will be taken is properly constructed. In the first
part of the derivation [through (5.1.9)], we shall deal with uniform weighting on the
ensemble, just as was done in the earlier part of Sec. 3.1. Then / and n must be
chosen so that 2' x Q", and each binary /-tuple input sequence should be mapped
into a unique Q-ary n-tuple output sequence. This can be achieved exactly if Q isa
power of 2, and otherwise approximated as closely as desired by choosing / and n
sjoquiAs
Aie-(Q u
x
sulddew
JoquwiAs
jouuerys
0}
joquiAs
Areulg
sjoquiAs
Areutq |
d1BO]
Ivoury
‘jauueyo yndui Aie-b uO 9pod u/q = 4 10} JapOOUD [BUOTNNJOAUOD J's andy
u/q=4
AAA 4
A
A
oO
A
A
ih
A A AA
roeEEs
A
A
co
a
303
304 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
sufficiently large. This results in approximately Q” possible Q-ary sequences with
uniform weighting if the original 2' binary sequences have uniform weighting.
We shall consider nonuniform weighting on the Q-ary sequence below.
Now let us consider the correct path and any incorrect path unmerged with it
from node j to node j + K + k. If we take the correct path to correspond to the
all-zeros data (without loss of generality), then its code sequence is Vo, ;, Vo, j+15
Vo, j+2>-++> Yo, j+K+k-1 and there are 2"**” possible binary sequences over the
(K + k)-branch unmerged span. After mapping this sequence onto the signal
vector X,, we have Q"***) possible Q-ary x sequences over this span. As for
the incorrect path in question, it must correspond to a data vector uj over
the unmerged span of the form of (5.1.2) where uj; --- uj4,— 1 Contains no strings
of more than K — 2 consecutive zeros. This implies then that each of the corres-
ponding branch vectors of the form v; is formed by the modulo-2 sum of vo ; and
at least one of the vectors g}), g{?, ..., gk 1 [see (5.1.4)]. Thus vj, Vj415---5Vj+K+n—1
can be any one of 2“**” possible binary sequences over the (K + k)-branch
unmerged span, and therefore x; can be any one of Q"**" Q-ary sequences,
independent of what the correct sequence x; may be. As a result, we may average
the bound (5.1.3) over the Q?" [where N = n(K + k)] possible correct and incor-
rect sequences as follows
tO
which is clearly independent of the node j.
Also, using the fact that the channel is memoryless and letting g(x) = 1/Q for
all x in the channel input alphabet, we obtain
[Ear]
— g~NRola) (5.1.5)
N
Pr [AM(x;, x;) > 0] <
where N = n(K + k) and
Ear T)) (5.1.6)
x
R,(q)=—-In>
¥
Finally, inserting (5.1.5) into the ensemble average of (5.1.1) and using 2* as the
upper bound on the number of incorrect paths x’ diverging from the correct path
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 305
at node j and remerging K + k branches later, we obtain a bound on the ensemble
average of the node error probability for each node, namely
oe
2 De- n(K +k)R,(q)
k=0
P.(j) <
4
e KnR(a)
Piz 2 — {nRo(a)/In 2A}
(5.1.7)
Since r = 1/n is the rate in bits per channel symbol, to define rate in nats per
symbol as for block codes, we let
R=rln2
= (In 2)/n nats/channel symbol (5.1.8)
and thus obtain
eee 2—KRoIR
Pj) <7—>--we | (0 < R<R,(q) (5.1.9)
where R,(q) is defined in (5.1.6).
Note also that, just as was done for block code ensemble averages in Sec. 3.1,
we may impose a nonuniform weighting q(x) on the channel input symbols. To
achieve this nonuniform weighting, we must choose the binary to Q-ary mapping
of Fig. 5.1 differently from that described after (5.1.4) for uniform weighting. Now
let 1 = nd and let each binary A-tuple be mapped into a Q-ary symbol. Further let
the mapping be chosen such that exactly r; of the 2* binary A-tuples map into the
Q-ary output symbol x;, where i = 1, 2, ..., Q, and
Q
Ds sk a
i=1
Thus by choosing A, and hence /, sufficiently large, any nonuniform distribution
can be approximated arbitrarily closely [by the distribution (r;2~*)] starting with
a uniform distribution on the binary /-tuples. Thus, (5.1.9) is valid even when R,(q)
is defined with an almost arbitrary nonuniform q(x).
The bit error probability, defined as the expected number of bit errors per bit
decoded, can be bounded by the same: argument used in Sec. 4.6 [preceding
(4.6.21)]. There we noted that an incorrect path which has been unmerged over
K +k branches can cause at most k + 1 bit errors, for, in order to merge with the
correct path, the last K — 1 data bits must coincide. Thus it follows, using (5.1.8),
that the ensemble average of the expected number of bit errors, caused by a node
error which begins at node j, is bounded by
E[n,(j)] 1< k + 1)2ke- mK +ORot@)
# AS KR ,(q)/R
Top eww = 0<R<RAQ) (5.1.10)
306 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Comparing R,(q) as defined in (5.1.6) with the Gallager function E,(p, q) as de-
fined by (3.1.18), we find
R,(q) = E,(1, 4) (5.1.11)
which is strictly less than capacity C for all physical channels, but may equal
capacity in certain degenerate cases as was shown in Sec. 3.2 [see (3.2.11)].
To extend our bounds for rates up to capacity, we must employ a more refined
argument than the simple union bound used so far. The technique, based on the
Gallager bound, is similar to that used in the latter half of Sec. 4.6. We begin by
considering the set of all incorrect paths diverging from the correct path at node j
and unmerged for exactly K + k branches, and take the sum over all k > 0. Thus
the node error probability at any node is bounded by
Peli) < YT) (5.1.12)
where
error caused by any one of up to 2* incorrect
1,(j) = Pr |
paths unmerged from node j to node j +k+K
This, then, is still a union bound, but over larger sets. For I1,(j) for a given code,
we can again use the Gallager bound (2.4.8)
TAG) <¥ pwtyle) Yay ye”
A
p
(5.1.13)
where N = n(K +k), 2(j), whose cardinality is no greater than 2*, is the set of all
incorrect paths diverging at node j and remerging K + k branches later and Xx; is
any member path of this set.
As before we note that x;, defined by (5.1.4) with u = 0, can be any one of Q”
possible sequences. However the set 2(j) is somewhat more restricted. For exam-
ple, suppose k = 2. Then there are just two compatible! paths in the set 2(j) whose
data sequences, between node j and node j + K + 2, are
LOLOO G8 and 111000-:--0
-K-—1-> +K—l1-
But obviously over the first branch, after diverging from the correct (all-zeros)
path, the two paths in question are still merged and hence their branch symbols x
are identical for this branch. Yet, even though the cardinality of (j) is limited,
any single path in this set can take on any one of Q™ code sequences, as can be
shown by exactly the same argument as before. However, when one path has been
chosen, all the others compatible with it are restricted in the choice of their code
symbols, to a lesser or greater extent depending on the span over which they are
' Compatible refers to those incorrect paths which are unmerged from the correct path in the given
number of branches.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 307
merged with already chosen paths. Let us then assign the weight qy(x;) to the
N = (K + k)n symbols of the correct path between nodes j and j + K + k; qy(x;)
equals 1/Q” if we use a uniform weighting. Also, we assign the weight
visa (8S. , Ax, 38°) where {20 >i =, 2y55.5 MM} 2G)
is the set of compatible incorrect paths. For uniform weighting, this weight will
just be the inverse of the number of distinct choices for the set of path sequences;
in general, gyy(-) has the property that its sum over all distinct possible members
of the set 2(j) equals unity. In fact, this notation allows us to augment the set #(/)
to include all Q’™ choices of the M vectors x$", ..., x", where M < 2*, whether
or not they are compatible based on the trellis structure just described, since any
inadmissible combination may be eliminated by assigning it zero weight.
Thus, averaging (5.1.13) over the ensemble with the weighting just defined, we
have
II,(J) 7 2 n(X;) > me > duu (XS, x tty x )TT,(J)
xj > am"
<i d awlxsPwly [xe Do DE ava (KP s KP?s --- KHPP)
y Xj s()) ”™
M p
x) ) [pwly |x)” O<p<l
(5.1.14)
Note that the summation on i is now unrestricted, since any inadmissible path
combinations are excluded by making qyy(-) zero for that choice of X{, x, ...,
x). Then, limiting p to the unit interval allows us to use the Jensen inequality
(App. 1B) to obtain
Th(/) < YY an(x,)pw(y |x)" *”
. se
M | F
x » : a9 as Gnu (X$”, X02), aa RO pa (y [ROG +p) ree
&;
Now, for the terms in braces, suppose we consider the ith term of the outer sum
and sum over all the internal summations except x‘. Since only qyy(-) depends
on these x‘ + x, we have
T1,(j) < >» Dd anx;)Pwly |x," *”
y xj
M p
x b > anl&P nly |RP)O4) — OO< p< (5.1.15)
Jj
The key observation to be made is that, as a result of this last step, we limit
consideration to a single incorrect path x‘. And, as was discussed previously, even
308 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
though the choices of the set of path sequences for the entire incorrect set %(j) is
limited by trellis constraints, the symbols for any single path may be freely chosen
among Q” possible sequences in the space 2. Thus the weighting qy(Xx‘) is the
same as qy(x;) for the correct path (both being 1/Q” if uniform weighting is
assumed). Hence the bound (5.1.15) can be written as?
T1,(/) <2" >. bs n(X)Pn(y ayer
= es
ep
= aly. b q(x)p(y|x)"/@ ae O<p<1 (5.1.16)
a
since M < 2*, the channel is memoryless and qy(x) is a product of N identical one-
dimensional weight functions. Since N = (K + k)n, this may be written as
T1,(j) < 2h e7 K+HnEolo. a) (5.1.17)
where as was first defined in Sec. 3.1
E,(p, = = an d
1+'p
2 q(x br xr O<p<1 (3.1.18)
Finally, substituting (5.1.17) in the ensemble average of (5.1.12) and using (5.1.8),
we obtain as our bound on the ensemble node error probability
rae a 8
k=0
< e KnEolp, 4) 3 Dke o— knEdp, a)
k=0
2 ~ KEolp, a)/R
ie 2 ~ tEole, q)/R]— p} ps E,(p. q)/R (5.1.18)
Similarly, the ensemble average of the expected number of bit errors caused by an
error at node j is obtained by weighting the kth term in (5.1.18) by (k + 1), since an
error caused by an incorrect path unmerged for K + k branches can cause no
more than k + 1 bit errors. Thus
E[n,(j)] ety a | (k + 1 )TT,(/)
k=0
re e KnEole, q) 3 (k 4. 12g ae q)
k=0
2- KE,(p, q)/R
a [1 = 27 oo R= ap? p <E,(p, q)/R (5.1.19)
2 Here we assume there are M = 2* such paths. Since this is larger than the actual number of
incorrect paths, this gives us a further upper bound on the error probability I1,(j).
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 309
There remains only the problem of choosing the parameter p and the best weight
distribution q. Note also that (5.1.18) and (5.1.19) reduce to (5.1.9) and (5.1.10),
respectively, for p = 1, as follows from the definition (5.1.11).
The function E,(p, q) was first studied in Sec. 3.2 and its basic properties,
summarized in Lemma 3.2.1, are that it is a positive increasing convex A function
for positive p, approaching 0 as p—0 with slope I(q) (see Fig. 3.1). Thus to
minimize the bounds for asymptotically large K, this suggests that we should
choose p as large as possible consistent with a positive exponent in the braces of
the denominator of (5.1.18) and (5.1.19). Such a choice would be, for small * « > 0
= se aa (5.1.20)
which reduces the bounds (5.1.18) and (5.1.19) to
a KER, q)/R
ea (5.1.21a)
a KER, q)/R
E[n,(j)] < [1 2a RP (5.1.21)
where the exponent E,(R, q) is established by the parametric equations
EAR, q) = E,(p, q) O<p<l
R area é)ay Rigo oR = Ne <) 5.122)
The construction of Fig. 5.2, based on the properties of E,(p, q), establishes that
the exponent E,(R, q) is positive and that the rate R increases continuously from
R = (1 — ©)E,(1, q) = (1 — €)R,(q)
to
R = (1 —€) lim [E,(, q)/p] = (1 — ©)I(q)
p70
as p decreases from 1 to 0. Recall also from Sec. 3.2 that
max I(q) = C
q
which is the channel capacity.
> Of course ¢€ is any positive number. Even though all our results are functions of ¢, exponents are
plotted for the limiting case of « = 0, for which they are maximized. Strictly, as « — 0, the multiplying
factor approaches oo, although only algebraically (not exponentially) in 1/e.
310 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Eo (pe, q)
|
t eee Figure 5.2 Construction for upper bound
0 p* 1 exponent (0 < p < 1).
Finally we may combine (5.1.9) and (5.1.10) with our present result, with
exponents maximized with respect.to the weight distribution q. This yields
4 — KEAR)/R
PAs s 1 p= EAR (5.1.23a)
7 — KEAR)/R
E{n,(j)] < [1 — 2- eaRvRp? (5.1.23b)
where
E.R) = R, = max E,(1,q) for0<R<R,(1-e) (5.1.24)
q
and
E(R) = max E,(p,q) O<p<1
q
for
R=(i— dimx le eae OR eR = Cae 195)
The composite exponent is plotted for a typical memoryless channel in
Fig. 5.3. Maximization of E,(p,q) with respect to the weight distribution
q = {q(x): x = ay, az, ..., dg} is performed exactly as in Sec. 3.2 (Theorem 3.2.2).
It is clear that, for asymptotically large K, « may be chosen asymptotically small.
E,(R)
R
fe)
|
|
|
|
|
|
|
|
|
R
0 RA ey CEL en Figure 5.3 E,(R) for typical memoryless channel.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 311
It remains to generalize this binary trellis (rate 1/n) coding result to trellises
with 2° branches* per node (rate b/n). Such encoders, shown in Fig. 5.1, require
effectively b shift registers, each of constraint length K, and the decoder storage
and computational complexity grows as 2°“~ ). For the present analysis, we need
only determine the form of the data sequences for all incorrect paths diverging at
node j and remerging with the correct path after an unmerged span of K + k
branches, where again, without loss of generality, we may take the correct path to
correspond to the all-zeros data sequence. For binary trellises, this was given by
(5.1.2). For 2°-ary trellises, this is generalized to the form
U;, Uj+1, Uj+2,---, Uj+,, 0,0,...,0
-K-1—> (5.1.26)
where all terms are b-dimensional binary vectors representing the b bits input to
the encoder register per branch. Now, u, and u,,, can be any of the 2” — 1 nonzero
b-dimensional binary vectors, since we require that the path diverge from the
all-zeros at node j and not remerge before j + k + K. And u,,, through u,,,_,
may each be any b-dimensional binary vector, the only limitation being that no
string of K — 1 or more consecutive 0 vectors may begin before the (j + k + 1)st
branch, for otherwise remerging with the correct path would occur before node
j+K+k. Thus there are less than (2? — 1)2™* possible incorrect paths in the
subset %(j) of incorrect paths which diverge at node j and remerge at node
j+K +k. Hence, all results obtained for rate 1/n trellis codes can be generalized
to rate b/n by replacing 2* with (2 — 1)2”* in expressions (5.1.7), (5.1.9), (4.6.16),
(5.1.12), and (5.1.16) through (5.1.19). It suffices to consider only the last two
expressions, which represent the most general case. Thus for rate b/n codes
< e KnEolp. q)
k
(2° ust HYD Aen. SWF
. [(2° *~ 1)2*[Pe7 Eee. @
=0
1 — 2 SWE, @)/RI— p} O0<p<E,(p,q)/R<1 (5.1.27)
where
R=rln2
= (b/n) In 2 nats/channel symbol (5.1.28)
* The mapping function for rate b/n codes is the same as for rate 1/n codes—see the description
following (5.1.4) and (5.1.9) for uniform and nonuniform weightings. We could even consider trellises
with B branches per node, where f is not a power of 2. However, this requires linear encoders with
input data in nonbinary form, a very impractical possibility. Also it requires that all linear operations
be performed over a finite field of B elements; hence B must be a prime or the power of a prime.
312 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Similarly generalizing (5.1.19) for rate b/n, recognizing that an erroneous branch
can cause up to b bit errors, we obtain
io. @)
E[n,(j)] < > b(k + 1) TL.)
k=0
oO
<:¢, ee Oy bk + 1 — 1 eg oe
k=0
b(2° ced 1) SP. q)/R
< (1 — 2 PUES, aki 032
0<p<E,p,q/R<1 (5.1.29)
Choosing p = 1 for
ReERIE-o) (5.1.30a)
and
pe ee ioe (5.1.30b)
for higher rates, we generalize (5.1.23) and (5.1.24) to rate b/n codes by replacing
E({R) by bE,{R) and multiplying both expressions by 2° — 1 and the second also
by b.
All our results thus far have been for events at a particular node level.
However, bit error probability is defined as the expected number of bit errors over
the total length of the code, normalized by the number of bits decoded. Thus for
an L-branch trellis code of rate b/n, since b bits are decoded per branch
<r: 2 E[n,(j)] (5.1.31)
inequality follows from the fact that bit error sequences may overlap, as discussed
in Sec. 4.4. Consequently, combining (5.1.29), (5.1.30), and (5.1.31) and optimizing
with respect to q, we obtain over the entire length of the code
LE[n,(j)]
ance
2 — KbEAR)/R
a be [1 — 2 = GEAR e>0 (5.1.32)
E(R)=R, O=< R= R{1—-€) (5.1.33)
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 313
Since this is an ensemble average over all possible trellis codes of length L
branches, we conclude that there must exist at least one code in the ensemble with
P, < P, . Hence we obtain
Theorem 5.1.1: Convolutional channel coding theorem (Viterbi [1967],
[1971]) For any discrete-input memoryless channel, there exists a time-
varying convolutional code of constraint length K, rate b/n bits per channel
symbol, and arbitrary block length, whose bit error probability P,, resulting
from maximum likelihood decoding, is bounded by (5.1.32) through (5.1.34)
where ¢ is an arbitrary positive number.
5.2 EXAMPLES: CONVOLUTIONAL CODING EXPONENTS
FOR VERY NOISY CHANNELS
As was done in Sec. 3.4 for block codes, we now evaluate the error bound expo-
nents for convolutional codes, for the class of channels for which explicit formulas
are most easily obtained. This will provide a direct comparison of the performance
of block and convolutional codes. Of course, most of the effort is involved in
computing E,(p) and C and the techniques to do this are already available from
Sec. 3.4.
For the class of very noisy channels defined by (3.4.23), we have that
E,(p) = max E,(p, q)
q
p
eee GS-7 < I 3.4.31
ae p ( )
Substituting this into (5.1.33) and using (5.1.11), we obtain R, = C/2 and hence,
for low rates
E(R) = G=R <(1— CP (5.2.1a)
2
For higher rates, substituting (3.4.31) into the second parametric equation (5.1.34)
and solving for p, we obtain
Then substituting this into (3.4.31) and in turn into the first parametric equation
(5.1.34), we obtain
R
E(R)=C —
(R)=C Limaig
(tO? SRSA ec (5.2.1b)
314 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
lim E.(R)
e>0O
Cj2
R__ Figure 5.4 lim E,(R) for very noisy channel.
e70
Ignoring for the moment the parameter ¢ < 1, we plot the composite exponent
(5.2.1a) and (5.2.1b) in Fig. 5.4 and compare it with the exponent for block codes
on very noisy channels given by (3.4.33). To obtain a meaningful comparison, we
must let (for convolutional codes)
Kb In 2
N, = Kn=—— (az)
R
For block codes, of course
K, In 2
ie a (5.2.3)
where K, is the block length in bits.°
With these definitions, the exponents of the bounds on error probability are
N,E,(R) and NE(R), for convolutional codes and block codes, respectively. We
recall from Sec. 4.6 that the relative decoding complexities per bit are
2Kn @NR
K, wary comparisons/bit (block codes)
and, as follows from a direct generalization of previous results to rate b/n convolu-
tional codes
9b oe 1 JQd(K— 1) Kb eNcR
Cae)
= ae comparisons/bit (convolutional codes)
Thus, setting N = N,, we find that, while the exponents diverge considerably, the
computational complexity is only slightly greater for convolutional codes. Clearly,
by making N, slightly smaller than N, we may achieve equal complexity, and still
maintain a convolutional exponent which is much greater than the block
exponent.
> Note that this compares encoders with the same “memory” since a convolutional code symbol is
determined by Kb information bits and a block code symbol is determined by K, information bits.
Decoder complexity grows roughly exponentially with this memory for both block and convolutional
codes.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 315
Also noteworthy is the fact that the exponent of (5.2.1) for very noisy channels
is identical to the exponent of (4.6.24) for convolutional orthogonal codes on the
AWGN channel, provided we make the obvious substitution
C nats/symbol _ C; nats/s
= 5.2.4
R nats/symbol _R;, nats/s ( )
The explanation is the same as that in Sec. 3.4 for block codes.
5.3 EXPURGATED UPPER BOUND FOR BINARY-INPUT,
OUTPUT-SYMMETRIC CHANNELS
We have thus demonstrated that the ensemble average convolutional exponent is
considerably greater then the corresponding block exponent everywhere except
for R = C and R = O. In the former case, both exponents, of course, become zero;
while at zero rate
E(0) = E(0) =Ro= E,(1)
But, for block codes, we found in Chap. 3, Sec. 3.3 that, by expurgating the
ensemble, we could obtain the much tighter upper-bound exponent®
E.,(0) = max |— )’ )) a(x)a(x’) In )) /p(y|x)p(y|x’)| (3.3.27)
q
For binary-input channels, this reduces in fact to
E.,(0) = —45 In Z > In 2 — In (1+ Z) = E,(1) (3.3.31)
where
Z => J Poly)pi(y)
y
Thus the convolutional coding exponents, obtained thus far, are weaker than
the block exponents at low rates. As already discussed in Sec. 3.10, it is not
possible to expurgate code vectors from a linear code without destroying its
linearity. With convolutional codes, not only would expurgation destroy linearity,
but it would equally damage the essential topological structure of the trellis.
However, on the class of binary-input, output-symmetric channels, we found in
Sec. 2.9 that for a linear code the error probability is always the same no
matter which code vector is transmitted. Hence, for this class of channels, we
need not expurgate, since the bound on the bit error probability for any trans-
mitted path is a bound for the entire code (independent of the path transmitted).
© For physical channels, this exponent is finite, but for degenerate channels this exponent can be
infinite.
316 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The task then is to obtain a tighter bound on P, at low rates. Consider again
node j and the probability that a bit error occurs at this node. A decoding bit error
can occur at node j only if, for some k and some i, 0 <i < k, an error event of
length K + k began diverging at node j — i; that is, if the bit in question lies within
an unmerged span corresponding to an error event. Since the event of a bit error is
the union of such error events,’ we have the union bound on the bit error
probability at node j
jes > I1,(j — i) (5.3.1)
where we recall that I1,(j — i) is the probability of an error event caused by one of
up to 2 incorrect paths unmerged from node j — i to node j —i+k+ K. For
any parameter 0 < s < 1, we have (inequality g in App. 3A)
i2 2 : Ig(j — i) (5.3.2)
The ensemble average of P}(j) is then
P3(j) < 5 YT =i) (j — i) (5.3.3)
k=0 i=
For low rates, we may use the union-bound argument, which leads to (5.1.10),
rather than the Gallager bound, which leads to (5.1.19), to bound 7;(j — i). Thus
for a rate b/n code
‘(j — i) < (2? — 1)2"[Pr {AM(xj-:, x;-;) = O}F (5.3.4)
where x,;_; and xj_; are the correct path and an incorrect path unmerged for
K +k branches, respectively. Then, by the same steps which led to (5.1.5)
Thi(j — i) < (2° - 1)2” » qn(X)qn(x
eed >» a(x)a(x OTpol)| Oe sei
(5.3.5)
where N = n(K + k). Finally, letting p = 1/s, we obtain
Tit!*(j = 1) < (2° — yee eo (5.3.6)
where, aS was first defined in Sec. 3.3,
Elo.) = 9 9 SS aesle)|E VPOTPOTD] "3319
7 The argument used here differs from that used previously for bit error probability bounds in this
and the last chapter, which was based on the expected number of bit errors per error event. While it
leads to the same result for the ensemble average bound of Theorem 5.1.1, it leads here to a tighter
form of Theorem 5.3.1 than was previously obtained based on the earlier argument.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 317
Thus substituting (5.3.6) into (5.3.3), we have
foe) k
pile < 3 >: (2 1)2°e —n(K +k)Ex(p, q)/p
k=0 i=0
@
= 2 (k + 1)(2° — 1)2*e —n(K +k)Ex(p, q)/p
(2° — 1)2~ KbEs(o, ad/(oR)
= (1 — 27 HEx(, ad/(oR)- i})2
l<p<o (5.3.7)
where R = (b/n) In 2 nats per channel symbol.
Since for binary-input, output-symmetric channels, P, is the same for all paths
of a given code, (5.3.7) can be regarded as a bound over the ensemble of convolu-
tional codes, or equivalently, over the ensemble of generator matrices (4.1.1). Thus
from (5.3.7) we have that for at least one code in the ensemble P;/? < Pj’; and
hence for this code
Pun (Pre
_ (2° oy 1) "9 — KBE xp, q)/R 5.3.8
= (1 — 27 SEX. a)/(eR)— 1)? ( ts )
We now choose p such that
E,(p,
(l+ep= = a) e>0 (5.3.9)
Finally, maximizing over q, we obtain
Theorem 5.3.1 (Viterbi and Odenwalder [1969]) For binary-input, output-
symmetric channels, there exists a time-varying convolutional code of con-
straint length K and rate b/n bits per symbol for which the bit error
probability with maximum likelihood decoding satisfies
2. — KbEcex(R)/R (5.3.10)
2 ite: 1 EE cex(R)/[ RU + €)]
P, = i pe =
where
E.-.(R) = max E,(p,q) l<p<o
q
R= max Ex(p, @) Oc K=<
a ee l+e
(5.3.11)
E>
where we have used the fact that
max E,(1, q) = max E,(1, q) = R,
q q
318 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Actually, we can obtain the exponent explicitly in terms of the rate, since, for
binary-input, output-symmetric channels, we found in Sec. 3.3 that
14+ Z}/e
max E,(p, q) = —p In ( a (3.3.29)
q
where
Z = \/Po(y)pi(y)
y
Thus combining (5.3.11) and (3.3.29), we find
e RU ey Ls r Aa
im
and, consequently, we have that
In Z
- (5.3.12)
pin le REO — 7)
Dividing the first equation of (5.3.11) by the second, and using (5.3.12), we
obtain
Corollary 5.3.1 The exponent of (5.3.11) can alternatively be expressed as
E.x(R) (l+e.nZ
RR “the 7879 7}
Note, finally, from this that
O<R=R Att) (53.13)
1
lim E,.,(R) = —=InZ (5.3.14)
R~0 2
which is precisely the same as the zero-rate exponent (3.3.31) for block codes.
The exponent (5.3.13) is plotted in Fig. 5.5 and compared with the corres-
ponding exponent for block codes.
5.4 LOWER BOUND ON ERROR PROBABILITY
For a rate b/n trellis code, let P,(j) be the probability that any of the b information
bits associated with node j are decoded incorrectly. Certainly the average bit error
probability, P,, is lower-bounded by the smallest such node bit error probability.
Thus
P, > min P,(j) (5.4.1)
j
Assuming that path lengths are arbitrarily long (L— oo), we now proceed to
lower-bound P,(j). First note that a decoding error at node j can be caused by
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 319
many possible paths that diverge from the correct path at node or earlier. Recall
that II,(j) is the probability that a path diverging from the correct path at node j
and remerging at node j + K + k causes an error event. Since this is only one of
many possible events that can cause a decoding error at node j, we have
P,(j) = 11,0)
for any k. Maximizing over k we get
P,(j) > max II,(j) (5.4.2)
k
For arbitrary k, I1,(j) is the probability of a block decoding error with no more
than 2” code vectors each of block length (K + k)n channel symbols. Thus, this
can be regarded as a highly constrained block code of length N = (K + k)n and
rate R, = (bk In 2)/[n(K + k)] nats per channel symbol.® Hence, using (3.6.45) and
(3.6.46), we have
I1,(j) > e~ NLEsp( Rk) + o(N)]
(K + k)b In2
= exp, — P [E.,(R, 4) + 0(K)]$}| A=k/K (5.4.3)
where E,,(R, 4) = E,(p) — pE;(p) (5.4.4a)
Ree Ce (5.4.4b)
A A
with
E,(p) = max E,(p, q) (5.4.5)
Thus combining (5.4.1) through (5.4.5), we obtain
P,(j) > max 27 Kbl(l + AdEsp(R, 4) + 0(K)]/R
AO
: 2S) mie [(1 + A)Esp(R, 4) + 0(K)]/R (5.4.6)
where we assume K sufficiently large that A can be any rational number; any
inaccuracy resulting from this is compensated for by the o(K) term. To minimize
the exponent, we must take the lower envelope with respect to A of
(1 + A)E.,(R, 4), which is defined parametrically by (5.4.4). We show now that
this function is convex U, and thus we can obtain a minimum by setting the
derivative equal to zero. For, from (5.4.4a), we have
ANS MER) _ (0) — pE sip) + (1+ AN —pESo)]
= Ep) — pEe)- PS (5.4.7)
8 Since the actual number of codewords is slightly less than 2, the actual rate is slightly less than
this. But for large K these differences are negligible and will be incorporated into o(K) terms in our
bound.
320 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
since from (5.4.4b) we have
ia oe
E;(p) dj us (1 es A) (5.4.8)
Differentiating (5.4.7) and using (5.4.8) and (3.2.5), we have
d*[(1 + A)E,,(R, A)] R?
Thus we may set (5.4.7) equal to zero and obtain the absolute minimum as a
function of A. We obtain
R
Soa ros!
and combining (5.4.10) with (5.4.4b) we have
R= Pele) (5.4.11)
p
while (5.4.4a), (5.4.10), and (5.4.11) yield
min (1 + A)E,,(R, 4) = pR = E,(p) (5.4.12)
A=O0
Finally, combining (5.4.6), (5.4.11), and (5.4.12), and recognizing that the argu-
ments used assume no particular decoding algorithm, we have
Theorem 5.4.1: Convolutional coding lower bound (Viterbi [1967a]) The prob-
ability of bit error, for any convolutional code and any decoding algorithm, is
lower-bounded by
P,, > 27 KblBesp(R) + o(KI/R (5.4.13)
where
E.s,(R) =E,(p) O<p<o
Rt ce pecRbec (5.4.14)
Thus the convolutional lower-bound exponent agrees with the upper-bound
(5.1.34) for the range R, < R < C (ignoring the es), but diverges at lower rates.
This parallels exactly the situation for block codes, except that the bounds for
block codes diverge at the lower rate E{(1) < E,(1) = Ro. We note also that at
zero rate we have
E..,(0) = lim E,(p) = lim [E.(p) we pE,(p)] sii E,,(0) (5.4.15)
po po
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 321
since either the monotonic increasing function E,(p) is bounded, in which case
lim pE;(p) = 0, or it is unbounded, in which case both exponents are infinite at
7er0 rate. Thus the convolutional and block code lower-bound exponents are
equal at zero rate, and neither bound is tight.
To improve the convolutional lower-bound exponent at low rates, we utilize
the zero-rate lower bound (3.7.19) instead of the sphere-packing bound. Then, in
place of (5.4.3), we have
TI, (j) > e7 ME=x(0)+ oy
— 9 (K+K)nlEex(0) + 0(K)] (5.4.16)
Although we have used the zero-rate exponent, this result is valid for any rate,
since the exponent must decrease monotonically with R and R,. Hence (5.4.6)
becomes
P,(j) > max e~ Kt hnlEex(0) + 0(K)]
k
— o~ Kn[Eex(0) + 0(K)]
— 27 Kb{Eex(0) + o(K)]/R (5.4.17)
where E,,(0) is given by (3.3.27) (see also Sec. 5.3). We may state this result as
Corollary 5.4.1 : Low-rate lower bound For 0 < R < R, < Ro, atighter lower
bound on bit error probability than that in Theorem 5.4.1 is
P,, > 2~KMEex(0)+ (KHIR (5.4.18)
where R, is the rate at which E,,,(R,) = E,,(0).
The exponent of this bound is sketched for a typical binary-input, output-
symmetric channel in Fig. 5.5, where it is compared with the low-rate upper
bound, the latter holding only for this class of channels. We note also that we
could have used the low-rate lower bound of Sec. 3.8 (see Viterbi [1967]), but this
would have yielded exactly the same results as (5.4.17).
We comment finally on the possibility of obtaining bounds which are asymp-
totically tight for all rates. The arguments of Sec. 3.9 for block codes apply equally
for convolutional codes. If the Gilbert bound is tight [conjecture (3.9.4)], then the
resulting lower bound (3.9.5) can be used in place of (5.4.4), yielding then a
low-rate lower bound for binary-input, output-symmetric channels, which agrees
everywhere with the upper bound of (5.3.13). Thus all aspects of block code
exponents are paralleled in convolutional code exponents, which are, however,
always significantly greater in the entire range 0 < R <C.
322 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Esp (0)
Exp (R)
E.x (0)
Figure 5.5 Expurgated ensemble and
sphere-packing bounds for convo-
R___lutional and block codes on binary-
input, output-symmetric channel.
5.5 CRITICAL LENGTHS OF ERROR EVENTS*
The maximization carried out in connection with the lower bound of the preced-
ing section [(5.4.2) and (5.4.6)] suggests that certain lengths of errors (unmerged
paths) are more likely than others. Based on the lower bound, it appears that the
most likely 1 ~ k/K is given by (5.4.10). Actually, to make this result precise, we
must use a combination of upper and lower bounds. First of all, we found in
Sec. 5.1 that the ensemble average probability of an error at node j caused by an
unmerged path of length K + k is bounded by (5.1.17) for rate 1/n codes, while for
rate b/n this generalizes [see (5.1.27)]| to
TT,(j) < (2° — 1)(2°p2- K+ HE WR = <p <i (5.5.1)
We shall call this an error event of length bk, since a run of errors will occur within
k branches of b bits each, with no two errors separated by K — 1 or more
* May be omitted without loss of continuity.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 323
branches,” each with b correct bits. Rewriting (5.5.1) in terms of
N= (K+ n= (K +042]
and
we have
Il,,(j) < (2° — 1) e NlEole, a)~ Rel O0O<p<il (5.5.2)
Since the exponent is identical to that of the block coding bound (3.1.17), minimiz-
ing with respect to p and q, we obtain the equivalent of (3.2.8), namely
Hii? + pee
SF te 9 9 Mira Tales fa ca (5.5.3)
where
A=k/K
E(R, 4)= E,(p) — pE\(p) O<p<1 (5.5.4a)
waRiog: R,=Ej(p) £E,1)<R, <C (5.5.4b)
1+A
and
E,(p) = max E,(p, q)
q
Even though this is only a bound, we may expect to obtain an indication of
the most likely run length of errors by maximizing (5.5.3) with respect to k (or,
equivalently, 4). Since, other than for asymptotically unimportant terms, (5.5.3) is
the same as the right side of (5.4.6), clearly the maximization (or minimization of
the negative exponent) proceeds identically, and we obtain again (5.4.7) through
(5.4.11). Let us call the length k = AK which maximizes (5.5.3) the critical length,
Keri. Thus from (5.4.10) and (5.4.11), we have
Ke E E’
ey ee 2 a O<p<1 (555)
K_ E,p)— pE(p) —E.(p) — pE.(p)
° Note that this does not quite mean that b(K — 1) correct bits cannot occur between two incorrect
bits. For example, if b = 2 and the second bit of the first unmerged branch and the first bit of the
(K — 1)st unmerged branch are correct (with the other bit on both these branches being incorrect), the
number of correct bits between successive incorrect bits may be as large as 2(K — 2) + 2 = b(K — 1)in
this case.
324 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
We will next show that, for large K, the run length of errors tends to concentrate
around k,,;,. More precisely, we prove
Theorem 5.5.1: Error run lengths (Forney [19725], [1974]) Over the ensemble
of time-varying convolutional codes, for any ¢ > 0, the average fraction of
error events of run length k outside the interval k,,;, —«K <k <k,,;, + «K
approaches 0 as K —> oo, where
Kis pEs() | Sapch Roe kee
=) E,(p) — pE,(p) (5.5.6)
0 Us Kk < k,
PROOF (5.4.13) is a lower bound on event error probability for the best code,
but in the high-rate region, R, < R < C, it agrees asymptotically with the
ensemble average upper bound. Hence, in this region, this is an asymptotically
exact expression for the ensemble average event error probability, P,. For
lower rates, we have from (5.1.9) that
Prat De a ne ee Roe Fe (5.5.7)
But, over the same ensemble, this is also a lower bound to the average event
error probability since P, is lower-bounded by the average probability of
pairwise errors for one incorrect path unmerged for the minimum length,
which is just K branches. Averaged over the ensemble, this lower bound is the
same as (5.1.5) except for a negligible o(K) term, since that result is based on
the Bhattacharyya bound (5.1.3), which can be shown to be asymptotically
tight by the methods of Sec. 3.5. Hence
a 2 — KbiRo + o(K)I/R 2k =F.
7 — KblEd(o*) + (K)V/R Ry < R=E,(p*)/p* < C (5.5.8)
e
Combining (5.5.1) and (5.5.8), we obtain for the high-rate region
Pikes ae
P anki 2 — KbLE o(p*) + o(K)I/R
e
ey iS ih eae KbE(p)/R y YE acnadaai —p) y*
k=AK
2 — KdLE o(p*) + o(K)/R
2°—1
~ | — 2 BLE o(e/R =p)
x 2~ KbLEo(p)— Eo(p*)I/R + AEo(9)/R — p] + 0(K)} (5.5.9)
<
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 325
where, from (5.5.8), we see that p* satisfies
*
ee Esp (5.5.10)
p
and where p must satisfy the condition
E
ee (5.5.11)
The exponent coefficient in (5.5.9) can be made positive for A large enough.
We next examine the critical value of A where the exponent is zero in the limit
as p — p*. The critical value of / satisfies
E,(p) es E,(p*) | if Pah % 0 6 (5.5.12)
or
Eig) - EM)
A= R= EG (5.5.13)
Using (5.5.10), we have
, — P*LE(e) — E.(e*)|
pE,(p*) ned p*E,(p)
Oe ed) PVP)
~ E.(o) — elE.() — Eo*)Vo = 9*) aan
and
ae p*[E.(p) — E.(0*)\/(p — p*)
don = Tim (0) — olEslo) — E,lo* V0 — 0°)
i eae
~ _E,(p*) — p*E,(*) ee
which is exactly (5.5.5). Hence by choosing 4 = 4,,;, + 6 we have
a eee a aie Dr, (5.5.16)
Noting that k,,;, maximizes the bound on II,(j), we can similarly show that
P ears
fin Pee en SY (5.5.17)
K->« rr
which completes the proof in the high-rate region.
326 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
In the low-rate region, we have from (5.5.1) with p = 1 and from (5.5.8)
fo 6)
(2° es 1)27 KeRIR . (2- ARR 11)
Pr {k ae eK} < k=eK
P. 4 2 — KbiRo + o(K)I/R
2 =A — Kble(Ro/R— 1)+0(K
= 1 = RIR=1) 2 €(R o/ )+ 0(K)] (5.5.18)
Hence also
.« Pik See
im EES oo ke, (5.5.19)
Ko e
and we have shown that the fraction of error events with lengths which
deviate from k,,;, of (5.5.6) by eK approaches zero as K > oo for any € > 0.
This proves the theorem.
Figure 5.6 shows the ratio A,,;, = k.,i,/K as a function of R for a typical
memoryless channel. For the class of very noisy channels (see Sec. 5.2), we can, in
fact, obtain an exact expression, since in this case E,(p) = pC/(1+ p) and
p = (C/R) — 1 for C/2 < R <C, so that
0 Oa kK < Ce
mi BS 1 (5.5.20)
Cu ee
cnet o>
Thus, for asymptotically large constraint lengths, the “ most likely ” error length is
very small for R < Ro, increases stepwise at R,, and grows without bound as
R-C. For very noisy channels, the step increase at Ry = C/2 is equal to one
constraint length.
k crit / K
R Figure 5.6 Normalized critical length of
error runs.
Q --—-----------_"--
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 327
5.6 PATH MEMORY TRUNCATION AND INITIAL
SYNCHRONIZATION ERRORS
In Sec. 4.7, we indicated that practical storage constraints require limiting the
path memory for each state to a finite length, usually a few constraint lengths. One
way to truncate memory at t branches is to make a maximum likelihood decision
among all paths which are not merged t branches back. It easily follows that a
truncation error can occur only if an incorrect path which diverges from the
correct path at the jth node, and remains unmerged from it for t branches, has
higher metric than the correct path after t branches. For, if the paths merged
before t + 1 branches, the path with higher likelihood would survive, whether or
not truncation were employed. Thus, consider the set 2(j; t) of paths which
diverge from the correct path at node j and remain unmerged for exactly t
branches. Now there are no more than 2” such paths. Thus, by exactly the same
argument used in Sec. 5.1, analogous’® to (5.1.17) but for b > 1, we find that the
ensemble average probability that an incorrect path has higher metric than the
correct path after t unmerged branches is bounded by
P,(j) < 2be~ mE ole. a) — 2—belEdlo, a)— pRIIR O0<p<il (5.6.1)
Thus, maximizing with respect to p and q, we obtain the usual ensemble error
upper bound for block codes of block length b(In 2)t/R.
a Site tee (5.6.2)
where
E(R)=Ro—R 0<R<E(1) (5.6.3)
and for the high-rate region
E(R) = E,(p)— pE(p) O<p<1
R= EX(p) E\(1)<R<C (5.6.4)
Comparing (5.6.2) with (5.1.32) we may conclude that truncation errors will not
significantly (exponentially) affect the overall error probability if the truncation
length t is such that
tE(R) > KE(R) (5.6.5)
where E(R) of (5.6.3) and (5.6.4) is the block coding exponent, and E,(R) is the
convolutional coding exponent of (5.1.33) and (5.1.34).
10 This is just the block coding error bound for a code of nt symbols and 2” codewords.
328 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
For very noisy channels, condition (5.6.5) reduces to
1
1
WK>{s ep C4 RS CPR (5.6.6)
1+./R/C
1— JRIC C/2<K<€
Note that, at R = C/2 = Ro, this indicates that the truncation length for very
noisy channels should be t> K/(,/2 — 1)? +5.8K. In practice, truncation
lengths of 4 to 5 constraint lengths have been found sufficient to ensure minor to
negligible degradation.
Another problem arising in a practical decoder is that of initial synchroniza-
tion at any node other than the initial node. As we indicated in Sec. 4.7, synchro-
nization eventually occurs automatically once the initial symbol of each branch has
been determined (which we assume here has already occurred). However, during
the early stages of synchronization, many errors may occur. The situation in
starting in midstream is that no initial state metrics are known. Thus we may take
them all to be zero. In decoding in the usual way, we may regard as an initial
synchronization error, any error which is caused by a path which is initially un-
merged with the correct path, for an error caused by any path initially merged
would have occurred anyway. Now s branches after decoding begins (in mid-
stream with all metrics set to zero at the outset), there is a set of at most 2°
initially unmerged paths, which are merging for the first time with the correct
path. Clearly, this set is the dual (and the mirror image) of the set #(j; t) con-
sidered above in connection with truncation errors. Thus the probability of initial
synchronization error decreases exponentially with s, the number of branches
after initiation of decoding. In fact, the ensemble average upper bound on initial
synchronization error is the same as (5.6.2) with t replaced by s. Thus after s
branches, where
sE(R) > KE,(R) (5.6.7)
the effects of initial synchronization on error probability become insignificant. In
practice, the first s branches (s ~ 5K) are usually discarded as unreliable, when the
decoder is started in midstream.
5.7 ERROR BOUNDS FOR SYSTEMATIC CONVOLUTIONAL
CODES
In Sec. 2.10, we showed that every linear block code is equivalent in performance
to a systematic linear block code, and in Sec. 3.10 we showed that the best linear
code, and hence the best systematic linear block code, performs as well asymptoti-
cally as the best block code with the same parameters. That this is not the case for
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 329
systematic convolutional codes was intimated in Sec. 4.5 where we found that, in
general, the best systematic codes have smaller free distance than the best nonsys-
tematic codes.
We now proceed to obtain a more precise measure of the performance loss of
systematic convolutional codes by deriving upper and lower bounds. We recall
from Sec. 4.5 that a systematic rate b/n convolutional code is one in which, for
each branch, the b data symbols’ are transmitted uncoded, followed by n— b
parity symbols, which are generated just as for nonsystematic codes and con-
sequently depend on the last Kb data symbols. The systematic constraint affects
primarily the form of the code paths during remerging, for any incorrect path
remerges with the correct path only when (K — 1)b consecutive data symbols are
identical to those of the correct path. But when this occurs, exactly this many of its
code symbols are identical to the code symbols of the correct path (the first b
symbols of each of the K — 1 branches just before remerging). Hence, the effective
length of the unmerged code paths is reduced by (K — 1)b code symbols, since
identical code symbols are useless in discriminating between code paths.
We first determine the effect of this property on the upper bound of Sec. 5.1.
The bound (5.1.27) applies in the same way, but now the effective length of
incorrect code paths unmerged for (K + k) branches is only
N’ =(K+k)n— b(K —1)=K(n—b) +kn+b (5.7.1)
rather than (K + k)n, for the kth term of the summation. Note, however, that over
the first (k + 1) branches all possible data symbols are used; hence the ensemble is
not curtailed. Another viewpoint is that the kth term of (5.1.27) is an ensemble
average upper bound for a block code of 2°“* 1) code vectors of length n(K + k);
we showed in Sec. 3.10, based on Sec. 2.10, that the ensemble average upper
bound for systematic block codes is the same as for nonsystematic block codes.
Hence we may employ this result, but the “block code” resulting from consider-
ing (2° — 1)2°* incorrect paths unmerged for K + k branches has only N’ rather
than (K + k)n effective code symbols. Thus substituting N’ of (5.7.1) in place of
(K + k)n in the kth term of (5.1.27), we obtain
II,(j) < [(2° — 1)2°*}° ge [K(n— b) + kn + bJEo(p, a)
o (2° <; 1? plied teterenagel Aas oc aatl q)/R-p} 0< p< 1 (5.7.2)
where we have again used R = b In 2/n and r=b/n= R/In 2. Thus inserting
(5.7.2) for TI,(j) in (5.1.29), we obtain, in place of (5.1.29)
ee a i ee ee
E[n,(j)] < [1 — 27 SdE ote, ayRI— 93) O<psl
'! Tf the channel input is not binary but Q-ary, then | = vn (where v = [log Q1is the least integer not
less than log Q). Each sequence of vb input bits is transmitted, after mapping, as b Q-ary symbols
followed by | — vb coded bits mapped into (n — b) Q-ary symbols (see Fig. 5.1).
330 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Proceeding with the remainder of the steps in Sec. 5.1, we find that there exists a
systematic convolutional code whose bit error probability is bounded by
2 KbE,(R)(1—r)/R
P, < (2? — iy —D=bE ARR? (5.7.3)
where E,(R) is given by (5.1.33) and (5.1.34).
We now turn to the lower bound, modifying the derivation in Sec. 5.4 in the
same way. Here again b(K — 1) code symbols of remerging incorrect paths are
constrained to be the same as those of the correct path. Hence, in (5.4.3),
N =(K + k)n must be replaced by N’ of (5.7.1). This yields, in place of (5.4.3)
through (5.4.5)
II,(j) > te N 'LEsp( Rx) + o(N)]
ace e N’LEsp( Rx) + 0(K)]
= e- [K(n— b) +kn + b][E,p( Rx) + 0(K)]
> 2 ~ Kb. —r + A/RILEsp(R, A)+ 0(K)] (5.7.4)
where
k
d=z
b
i= -
n
E,,(R, 4) = E,(p) — pE,(p) (5.7.5a)
l+A-r.
R=—> :
= ee E‘(p) (5.7.5)
Then proceeding as in the remainder of Sec. 5.4, we have
P,(j)= 2~Ko[ min (1+ A—r)Esp(R, 4) + ofK)]/R
— 2—Kb[Ecsp(R\(1 — 1) + o(K)/R (5.76)
where
E...(R)=E(p) V<p<o
_ Flo)
p
R G<k=C Sl)
Thus the upper-bound and lower-bound exponents agree for Rg < R < C. While
we cannot, in general, obtain tight bounds for lower rates, we can improve the
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 331
lower bound by using the zero-rate lower bound (3.7.19) in place of (5.7.5) with the
result
P,(j) > 27 KMExOL +R = QE R<R, < Ry (5.7.8)
We summarize all these results as
Theorem 5.7.1: Systematic convolutional code bounds (Bucher and Heller
[1970]) For systematic convolutional codes, all the upper and lower error
bounds of nonsystematic codes hold with all numerator exponents multiplied
by
b R
Note that there is a severe loss when b/n is close to unity. Even for b/n = 4, the
reduction in exponent requires doubling the constraint length to obtain with
systematic codes the same asymptotic results as for nonsystematic codes.
5.8 TIME-VARYING CONVOLUTIONAL CODES ON
INTERSYMBOL INTERFERENCE CHANNELS*
We conclude this chapter with an application of the ensemble average error
probability analysis to the class of time-varying convolutional codes with the
intersymbol interference (ISI) channel, first defined and analyzed in Secs. 4.9 and
4.10. Figure 5.7a and 5.7b illustrates the analog model and digital equivalent of the
intersymbol interference channel, which are the same as in Figs. 4.20 and 4.21 but
with a rate b/n convolutional encoder preceding the channel. In Sec. 4.10, we have
shown that the maximum likelihood combined demodulator—decoder can be
realized with a Viterbi algorithm of dimensionality [(¥Y — 1)/n] + (K — 1) where
the trellis diagram comes from combining the convolutional encoder and ISI
linear filter into a single device. Here we shall assume such a maximum likelihood
demodulator—decoder.
In the trellis diagram for the combined demodulator—decoder, a path that
diverges from the correct path and later remerges for the first time can cause an
error event only if it accumulates a higher metric than the correct path while
unmerged. Such a path can correspond to a data sequence with a path in the
convolutional code trellis diagram which diverges and remerges with the correct
path more than once during the same span of branches over which it is totally
unmerged in the coded ISI trellis. We shall first consider only those paths for
which there is only one unmerged span in the code trellis corresponding to the
* May be omitted without loss of continuity.
332 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
n (ft)
Convolutional
u d x
enco a ee Impulse ES A(r) h(-1) ee eee
. eee generator
n
(a) Analog model
;}—_—_——_»>
Linger ones
logic etch | a ai’ |
(correlated
Gaussian noise)
(b) Digital equivalent
Figure 5.7 Coded ISI channel model.
unmerged span of the coded ISI trellis. That is, we first limit our discussion to
error events for which the unmerged span in the coded ISI trellis corresponds to
paths in the convolutional code trellis which diverge and remerge only once.
Let {x,} be the channel symbols (+1 or —1) of the correct path and let {x;}
be the channel symbols corresponding to a path that diverges from the correct
path and remerges for the first time after a span of N channel symbols. Here
=x n<0n>N+1 (5.8.1)
Defining ¢, = 4(x, — x,,),n = 1, 2,..., N we have from (4.9.18)
Poe ras \- y:(toe . 2, iet-s]| (5.8.2)
n=1 o =1
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 333
where the subscript E, indicates the restriction to an error event with paths that
diverge and remerge only once in the convolutional code trellis during the span of
N channel symbols. Suppose we could average P,,(€) over all sequences € = (€,,
€>,.---, €y) using the product measure
N
Gv(e) = [] (en) (5.8.3)
n=1
where
" ; =i, —1
a0 = {i (5.8.4)
2 c=
or, equivalently
ai (5.8.5)
Averaging P, (€) over this ensemble yields
Ws.) 2. YD [aGa) ex
€1 €2
P,(N) (5.8.6)
= (toe +20 he rs |
Note that this expression differs from (4.9.22) in that here the summation is over
all sequences and there is an additional weighting of 1/2%. It remains, of
course, to justify the validity of this weighting, as we now do by the following
argument.
Figure 5.8 illustrates the generation of the terms inside the product in (5.8.6)
for the two paths of N channel symbols which correspond to an incorrect path
that diverges and remerges with the correct path in the code trellis diagram. Its
right half resembles Figs. 4.22c and 4.23c (the uncoded cases), but the error se-
quence now depends on the code. The error sequence for a particular pair of
(correct and incorrect) information sequences u and w’ are generated as shown in
the left half of Fig. 5.8. The information sequence u is encoded by the convolu-
tional coder into the channel sequence x. The binary sequence is mapped into the
real channel inputs according to the convention “0” > +1 and “1” — —1. Since
x, = +1, the error sequence term is given by
X, — X;
& = 3X — X%) = ee (5.8.7)
Because of the linearity of the convolutional code, we may form the vector
/
d=3|x—x’|=G|x, — x1 |, 4]x2 — x3,
.. |X — Xn)
by first forming the modulo-2 sum of the binary information sequences v = u@ wu’,
encoding this sum using a convolutional encoder identical to that which encodes
u, and mapping the resulting binary sequence according to the convention
“0” +0 and “1” -— +1. The error sequence is then obtained, as determined by
(5.8.7), by multiplying this sequence by the coded information sequence. This
explains the form of the error sequence generator shown in the left half of Fig. 5.8.
‘ Ky Sew
—
(Ys *¥a)f
I0}e19U903
BUTLYSION
Fars
10} ¥19U93 OLIJOUI YOueIg
‘(oSvIOAv B[QUIDSUD) 1O}BIOUS SIIJOW YOULIG WISLIP 93¥)]S-IO1IG Q'S BNSIy
Jo}e19Ua8 sdUENbes IOLA
ee
eae ae |
0 ~<~0O
P “
sIapoous
x [eorjuep]
‘
I-~ |
I+ < 0
eT oe ee |
Ee Ss
o
=
=
334
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 335
Over the ensemble of time-varying convolutional codes, each component of
the vector d is equally likely to be 0 or 1 provided u and ware on unmerged paths
(or, equivalently, v has diverged from the all-zeros path). The bit error probability
is averaged not only over the code ensemble but over the data sequence u as well.
Since v varies over all binary sequences independent of u, the sequence x is
independent of the sequence d even though the two generators shown are identi-
cal. Each component of x is equally likely to be a + 1 or —1. Hence each compo-
nent of the error sequence € is 0 with probability 1/2 and +1 or —1 each with
probability 1/4, which verifies the weighting of (5.8.4).
The branch metric generator half of Fig. 5.8 is similar to that of Figs. 4.22c
and 4.23c except that here the weighting has an additional 5 factor to account for
the code ensemble averaging. We now present a straightforward matrix version of
the convolutional coded bit error bound discussed in Sec. 5.1 as modified for the
ISI channel.
Define the state sequence, which corresponds to the contents of the last Y — 1
stages of the branch metric generator
le po js tees Gp cece Par Aaeneny He or | (5.8.8)
and the shift relationship
ey i a ess Se (5.8.9)
Let
Ag =©@ Ay B,, :.-, Age-1_,
be the 3*~' possible distinct states. Initially we must have s, = Ap = 0 since,
before unmerging, the error sequence is 0. Also define
1 a gia |
TG: S,,) T GlEn) exp oe (he + 2 oy hitaty-«}] (5.8.10)
o i=1
and define the 3%~! x 37! matrix
where
ae f(.A;)) ifA,;= al<. A;) for some € € {—1, 0, 1} (5.8.12)
0 otherwise
Then (5.8.6) becomes
N
P;,(N) < [] F(a, s,)
ef}. £2 EN n=
1
wi
=[11--- 1A | (5.8.13)
0
336 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The matrix A is the state transition matrix of the intersymbol interference. It
has only three nonzero components in each row and each column, where the
nonzero components are branch values of a state diagram whose generation is
shown in Fig. 5.8. For # = 2, for example, it is the state transition matrix for
Fig. 4.22d but with the “0” state included with a self-loop and all branches also
weighted by the probability g(c). Hence
£0,0) f(,-1) £0, 1) Ay = 0
A= | f(—1, 0) 4, 21 pea prey Ay oa
ALO FG, 3° FH A, =
where
f(0, 0) =f(0, —1) =f (0, 1) =3
f(-1,0)=f(1,0) = 4a
f(-1,-1)=f(, 1) =4a,
f(-1,1)=f(L, —1) = 4a,
and dp, a;, az are given in Fig. 5.9(a) which presents the state diagram for this
case. Note that the bound in (5.8.13) represents the set of all paths of length N
starting from the initial state s; = 0. It can terminate in any state, however, since
merger of the code path guarantees only that ¢,=0, but the state-vector
S, = (€:-(¢-1)> +++» &n-2 €n—1) the contents of the register of Fig. 5.8, is arbitrary.
This also explains the fact that the premultiplying vector in (5.8.13) is (1 1 1--- 1).
; (a, + a)
1
4"2
+t. oe,
1 1
2 40 2
4 70 4 40
1 a, = e- o+2hy/No 1
ay = &o-2hy)/No 2
(a) State diagram (b) Reduced state diagram
Figure 5.9 Coded ISI channel error state diagram (ensemble average) for # = 2.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 337
By symmetry, the set of all paths ending at state “ — 1” is the same as the set of
all paths ending at state “1.” Hence, we have for # = 2 (for Y = 3 see Prob. 5.11)
P,,(N) <[1 1]A% 4
where
z_[ £09 £041)] A=0
sVUELG GLa] A= 4!
where
f(0, 0) =f, +1)=2
f(+1, 0) = 4a
f(+1, +1) =4(a, + a.)
The corresponding reduced state diagram is shown in Fig. 5.9b.
In general for memory Y, the 3%~ 1 x 3%~! matrix A corresponds to a state
diagram where the 3*~' — 1 nonzero states come in equivalent pairs, for which
the set of N-step transitions to these states starting at the zero state are the same.
Hence we can always find a reduced state diagram and the corresponding square
matrix A of size (3%~ 1 — 1)/2 + 1 such that (5.8.13) is expressed as
— 4
1
ur
P,,(N) ee 0 cae S/o ee (5.8.14)
0 |
Thus, in the following, the matrix A can be used interchangeably with A, with
concurrent reduction of the dimensionality of the vectors. Initially, however, for
clarity of exposition we shall consider the unreduced diagram; the reduction will
then follow immediately.
Recall that (5.8.13) is the convolutional code ensemble bound on the probabil-
ity that a path diverging from the correct path and remerging N channel symbols
later (in the convolutional code trellis) causes an error event. If the span over
which the two paths are apart is K + k branches, then N = n(K + k). The code-
ensemble average bit error bound due to these single code-merger error events is
then (see Sec. 5.1)
~ 5b (k + 1)2*P, (5.8.15)
where P, = P;,(N) with N = — + k). Substituting (5.8.13) into (5.8.15), we see
that P,, is bounded by
1
0
oY ated 0 (5.8.16)
0.
saree
| thy
338 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The matrix A is nonnegative and irreducible. The Perron-Frobenius theorem
(see, e.g., Gantmacher [1959]) states that such\a matrix has a real maximum
eigenvalue A and an associated positive left eigenvector. Defining « > 0 to be the
largest component of the left eigenvector divided by its smallest component we
have (Prob. 5.12) the inequalities
e3
i 0
—AN <[1 1 ++ 1A") | < adh (5.8.17)
0.
Thus (5.8.16) can be expressed as
——) @2 +}
P Os sri 8.
Sq ame? 2r4n< (5.8.18)
Up to this point, we have restricted the error events to those paths that
diverge and remerge only once in the convolutional code trellis during the un-
merged span in the coded ISI trellis. Now consider again the transmitted convolu-
tional coded sequence {x,} and another coded sequence {x;, corresponding to an
error event satisfying (5.8.1), but suppose that the paths merge twice in the convo-
lutional code trellis, merging at N, but diverging again at N, where
nK <N,<N,<N,+(£-1) (5.8.19)
This means that the code paths diverge again before the e€ register of Fig. 5.8 is
allowed to clear, for that would require N, > N, + (¥ — 1). We thus have, in
addition to (5.8.1)
pe n=N,+1,N,+2,..:, Nz (5.8.20)
This situation is sketched in Fig. 5.10 where the paths in the convolutional code
trellis merge at N, and N. The error sequence for the N coded symbols is thus
ee Saeco ree | Wey |, Se ema Fe (5.8.21)
Over the ensemble of time-varying convolutional codes with product measure
given by (5.8.3), € has measure
Ni N
qe) = [] en) [] ae) (5.8.22)
n=1 k=N2+1
1 N, Ny N N+(£-1)
— T a Pigy
Figure 5.10 Typical two code-merger path in the coded ISI trellis.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 339
For this error sequence, (5.8.2) becomes
Py,(€) < rH exp \- x (toe + 2 htate-i}|
n=1 a)
N f 1 ee
~ 411: ete ‘s x (lod +2) hvets-.)| (5.8.23)
k=N2+1 o i=1
and its average over the ensemble is
P(N NN ey [sla DY i Fleas8))
€4 én, n=1 €N2+1 €n k=N2+1
so 2 rH rt Eq Sq (1 1--> 1JA%—%2i(sy,41)) (5.8.24)
én, n=1
where i(Sy, + ;) is the Be ')-dimensional column vector with “1” in the position
corresponding to state sy,,,; and “0” elsewhere.'* An inequality similar to
(5.8.17) also applies here (Prob. 5.12) to give
[1 1-+- 1]A%~%2i(s)y 44) < ad? (5.8.25)
This bound eliminates the dependence on state sy, , , and allows separation of the
two code-trellis spans that make up the single-error event in the coded ISI trellis.
Thus (5.8.13), (5.8.17), and (5.8.25) yield the further bound on (5.8.24)
P,,(Ny, N — N2) < (adN*)(ad% 2) (5.8.26)
For fixed N,, N,, and N given above, the number of paths that merge twice in
the convolutional code trellis is bounded by (2° — 1)2*1(2° — 1)2°*, where
n(K + k,) = N, and n(K + k,) = N — N,. For such error events, there can be at
most b[(k, + 1) + (k, + 1)] coded binary symbol errors. Since
(k, + 1) + (kK, + 1) < 2(k, + 1)(k, + 1) (5.8.27)
(see Prob. 5.13 for generalizations to | code mergers), the code ensemble average
bit error — due to these two code-merger error events is bounded by
r= as DY y Mh Ak, +. 192" — 12™(2" — 1
oes 0 k2=0
x aan aa)
r ore) 2
= 2Io xy e+ neury |
§ =0
ce Bae Le
= " 5.8.28
"la — 2am’ ( )
12 This follows since here the initial state is not € = 0 but rather the € corresponding to the contents
of the register when the code paths diverge for the second time.
340 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The bounds for two code-merger error events easily generalize to error events
where there are / path mergers in the convolutional code trellis during the single
unmerged span in the coded ISI trellis. For any integer |, the corresponding
code-ensemble average bit error bound due to these events is
a(2’ — 1)
(1 — 2°A")
Taking the union of events bound over all error events, we find that the code-
ensemble average bit error is bounded by
pits: l
ms J : | poi, 2. (5.8.29)
a(2° — ue Ey gnk (5.8.30)
From this we obtain
Theorem 5.8.1 For an additive Gaussian ISI channel with # nonzero
coefficients hy, h,, ..., hy—,, there exists a time-varying convolutional code of
constraint length K and rate b/n for which the bit error probability with
maximum likelihood demodulation—decoding is bounded by
y(R)27 KeRolR
Po Sip (Ry REP (5.8.31)
where ‘
»(R) ee (1 — D2
R=" In2<Re
Ro=-—Ind___nats/channel symbol (5.8.32)
and where / is the maximum eigenvalue of the ISI channel transition matrix
A, and « is the ratio of the maximum component over the minimum compo-
nent of the positive left eigenvector associated with A.
The maximum eigenvalue / and the ratio of eigenvector components « are
the same for both the state transition matrix A and the corresponding reduced-
state transition matrix A (Prob. 5.12). In the case of duobinary ISI where hp = 6,
and h, = &,/2, we have the maximum eigenvalue
ay —1\?
i+ fi (=) |
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 341
A
Duobinary JS/
0.7
Without /S/
0.6 -
0.5 | | &,/No, dB
—j 0 1 3 4 5 6 7 8
Figure 5.11 Maximum eigenvalue for duobinary ISI.
and ratio
(24 — 1)/ao
R
I
where
Ao = e &slNo
Figure 5.11 shows 4 as a function of &,/N, for this special case, as well as for
the non-ISI AWGN channel (where hy) = &, and h, = 0) for which the only
nonzero eigenvalue is (1 + a,)/2. It is interesting to note that rate = 4 encoding
together with duobinary digital linear filtering results in no net change in the
signal spectrum; yet the performance loss relative to rate = 4 coding only, as shown
by Fig. 5.11 and (5.8.32), is less than 1 dB. Of course, there are now three
signal levels rather than two.
5.9 BIBLIOGRAPHICAL NOTES AND REFERENCES
The basic upper and lower bounds on convolutional codes in Secs. 5.1, 5.2, and 5.4
first appeared in Viterbi [1967a]. The expurgated bound for binary-input, output-
symmetric channels in Sec. 5.3 appeared in slightly weaker form in Viterbi and
Odenwalder [1969]. The results on critical lengths of error events and memory
truncation and initial synchronization errors in Secs. 5.5 and 5.6 are due to
Forney [1974]. The modification of the results of the first four sections for
systematic convolutional codes, treated in Sec. 5.7, is due to Bucher and Heller
[1970]. Application of the ensemble average error probability techniques to the
intersymbol interference channel with coding has not been published previously.
342 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
PROBLEMS
5.1 Find E{R) and E,,,(R) for the following channels and compare with the corresponding block
exponents
(a) All channels of Prob. 3.2.
(b) All channels of Prob. 3.3.
(c) Channel of Prob. 3.5.
(d) Channel of Prob. 3.6.
5.2 Verify (5.6.6) and plot k,,;,/K versus R forO < R<C.
5.3 (Construction ofa Block Code from a Convolutional Code by Termination with a Zero Tail, Forney
[1974]) Suppose we construct a block code of length (L + K — 1)n = N,symbols by taking L branches
of a rate b/n convolutional code and terminating it with a (K — 1) branch tail of all-zero data.
(a) Show that the rate of this block code is given by R, = [L/(L + K — 1)]R where R = (b/n) In 2
is the convolutional code rate in nats per symbol.
(b) Show that the block error probability of this block code is upper-bounded by
L2>—1) ;
32 1 — 2~ SLE0()/R - 0) po 0<p<il
(c) Letting
L
hy greet 0<6@<1
L+K-—1
show that
Peter ee
where
R E eu
E(R) = E,(p) ri é Rx E,(p)(1 — €)
p
and where ¥# is a constant independent of K, for « > 0.
(d) Now, since L and K are arbitrary, choose 8 so as to minimize the bound on P,,. Show then
that
P< Ke NoElRo)
where
E,(R,)= max {(1 —@)E(R)}
6, R: Rp=RO
(e) Substituting the result of (c) into that of (d) show
R
E,(R,) = max Es) a4 : 0
O<p<l1 —
Thus, aside from « < 1 and the constant .#, we have constructed a block code which is as good as
the ensemble average upper bound on block codes (Chap. 3).
5.4 In Prob. 5.3, suppose that after step (b), we arbitrarily choose L = k,,;, of (5.5.8)—the critical run
length of errors. Then show
hers:
pu R= Ep) Paps 4
Rog - K
K
E,(R,) ® oe x Pole) E,(p) — pE{(p)
and thus obtain the same result as in 5.3(e).
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 343
5.5 (Lower Bound by a Termination with a Zero Tail Argument: Alternative Proof to Theorem 5.4.1,
Forney [1974])
(a) Consider a terminated convolutional code as in Prob. 5.3. Show that this block code must
have block error probability
Py > 27 OK (Esp(Re) + (KAR — 0)
where
N,=(L+K-—1)n
L
R, = —————
L+k-1
(b) Applying the definitions of Prob. 5.3(c) show that
Pp > 27 KE sp(Re) + (KAR (1 — 8)
where
E.,(Ry) = E,(p) — pE(p) 0<p<o
R,= E(p)— RO --0<6<1
(c) Now show that the probability of at least one error in L branches of any convolutional code is
lower-bounded by
— bK[E¢sp(R) + 0(K)]/R
| a :
where
Eag(R) = ary
ml |
6, R:R,=RO i-@
et AR Pea)
= min ;
0<p<o@ 1 ne E‘(p)/R
and this minimization yields
E...(R)=E,(p) where R=
E
Ele) 0<p<o
p
5.6 (Upper Bound on Free Distance by a Termination with a Zero Tail Argument, Forney [1974])
(a) Show that, for a terminated convolutional code with parameter as in Prob. 5.3
dnin(Dlock) > d,,..(convolutional)
(b) Thus, given any upper bound for the block code
dmin(Dlock) < D(R,)
show that [using the definition of 5.3(c)]
sree dinin D(R,)
(ik NG 0) TWX 8)
where R, = RO
and hence
D(R@
a os min ( )
(K—I)n” o<gc; N,(1 —9)
344 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(c) Using the Plotkin bound D/N, ~ 3(1 — R,/In 2), show that (b) merely results in
d
free 1
(K-12
(Note that this agrees asymptotically with the result of Prob. 4.9. A tighter, more useful bound can be
obtained from the Elias upper bound with considerably more manipulation.)
5.7 (Gilbert-Type Lower Bound on Free Distance)
(a) Suppose d;,.. < 6(R) for every binary convolutional code. Show that, on a binary-input
output-symmetric channel, any such code yields
P,= » J Poly)Pi(y)
[6(R)— 0(K)]
~ ZAR)
(b) Suppose that
<
Kn ~ In (2e7% — 1)
Show that this would imply that, for every binary convolutional code used on a binary-input,
output-symmetric channel
as KbE.-,(R)/R
Pee
where
E..,(R) In Z
Rin Qe-® —1)
(c) Show that this is in direct contradiction to the upper bound of Theorem 5.3.1 and Corollary
5.3.1, and that hence there exists a convolutional code for which
6(R) eR
spent
Kn ~ In (2e-® — 1)
(d) Suppose we terminate this code in exactly the same manner as Probs. 5.3, 5.5 and 5.6. Show
that there exists a resulting block code with
dmin = Op(R)
where 6, satisfies
R=In2— #(6,) (Gilbert bound)
5.8 Consider an L-branch convolutional code of rate b/n. Show that, over some ensemble of convolu-
tional codes, the average node error probability for any node j is bounded by
P.(j) < L 27 KlEo() + o(K))/R
e =
where p satisfies
when E,(1) < R < C, and p = 1 for R < E,(1).
5.9 Prove (5.5.17) of Theorem 5.5.1.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 345
5.10 Consider a K = 3,r = 5 time-varying convolutional encoder where at time i when the binary data
symbol u; enters the encoder the output binary symbols are v; = (v;,, v;2) as shown below, and g, g,
(i)
81 :
ys
. : wa
> V;
of 2 Figure P5.10
and g$) are the time-varying connection vectors of dimension 2. Assuming the all-zero data sequence is
transmitted, we can consider the modified state diagram showing distances of all branches from the
all-zero path branch at time i as
vy
=x
pe
p37
where, for k = 1, 2, 3, ..., 7,
Zh) = v4 + 0,2
is the sum of the encoder output binary symbols for the kth branch of the state diagram at time i.
(a) Define, for the above state diagram, €,(D, J; j, i) = transition function for all paths going
from state a to state b at time j and going to state x at exactly time i, where x = b, ¢, d.
Let
E,(D, ee Fe i)
&(D, 1; j, i) = | €.(D, I; j, i)
E4(D, I; j, i)
and find A(i + 1) such that
&(D, 1; j, i+ 1) = A(i + 1)E(D, I; j, i)
zi D7
E(D, 1; j, j)= | 0 |
0
Initially we have
346 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
(b) For a binary-input, output-symmetric channel, show that the node error probability at node j
is bounded by
Pj) < DEAD, I; j, i)
i=j l=1,D=Z
(c) Suppose at each time i the time-varying connection vectors are independently selected at
random according to
P(g), ai”, 8%’) = (2)°
for all connections g*), g‘, and g{). Over this ensemble of time-varying codes, show that
P(j) < T(D, I)\ro1, n= +2212
and the averaged bit error probability is bounded by
<=... OF; Fj
t Pa
él I=1, D=[((1+Z)/2]2
where
D?I
T(D, I) =
1— D(1 + D)I
(d) Generalize (c) to arbitrary K and rate r = 1/n where
DX(1 — D)I
T(D, 1) = —
1 — D[i + 1(1 — DX-?)}
and
ie (- “| - 2 ~ (Ro/R)
This gives a bound which is exponentially the same as those of (5.1.23) for rates
R < Ry = —In[(1 + Z)/2].
Hint: See T(L, 1) given by (4.6.5).
5.11 Show that, for Y = 3, the 9 x 9 matrix A defined in (5.8.11) reduces to the 5 x 5 matrix
ae ea ates 4 J Ao =(0,0)
a 8-0 6 teres A, =(0, +1)
A=1|]0 ta, 4a, 4a, 0 A, =(+1, +1)
0 45a, 3a, 345 0 A3 = (+1, #1)
Gi eae eee. eee 0 S A, = (+1, 0)
where do, @,, 4), 43, G4, 45, 4g, a7, and a, are defined by Fig. 4.23. Here the state at time n is
S, = (€,-25 €,-1). Also sketch the reduced state diagram for this case and show that (5.8.13) becomes
7
P,(N) <[1111 14%
10]
5.12 Prove the inequalities (5.8.17) and (5.8.25) and show that 4 and a are the same for the state
transition matrix A and the reduced state transition matrix A. Note that a is the ratio of the largest
component to the smallest component of the positive left eigenvector of A associated with the maxi-
mum eigenvalue A.
CONVOLUTIONAL CODE ENSEMBLE PERFORMANCE 347
5.13 For nonnegative integers k,, k,, ..., k,;, prove the inequality
Lk+ )<IT] (k; + 1)
This general form of (5.8.27) is required to prove the code-ensemble average bit error bound for /
code-merger events given in (5.8.29).
5.14 Generalize the results of Sec. 5.8 to channels with an arbitrary but known finite memory part
followed by a noisy memoryless part where, for channel input sequence x = (x,, X,, ..., Xy), the
channel output sequence y = (y,, y>, --., yy) has conditional probability
N
Py(y |x, S;) ae ry P(Yn|Xn> Xn-1s+++5 Nenitgen ays
n=1
where s; = (X2_y, ..., X-4, Xo). This is a channel with memory ¥. Defining the state sequence
S, ae + are eae 2 Xn-2> , ee
the channel conditional probability becomes
N
Pry |x, S;) aay ba P(Yn|Xn; S,)
n=1
where there is a state transition equation
memory L£ memoryless
\
|
|
iy £€
Noiseless Noisy Siu ¥
|
|
|
|
pes Oat Sous Snes tee j Figure P5.14
(a) Assume two input sequences
» nd Pie ere: oy
ee
and initial states s, and s;. Show that, for the maximum likelihood decision rule, the two-signal error
probability is bounded as
P, (x, x’|s,,8:)< 3. >. / pl]; 8)Ply.|x,. 8.)
y n=1
= Yo EV Pvalxn» Sa)P(Vn| Xn > Sa)
n=1 yp
(b) Select the components of x and x’ independently according to the probability distribution
q(x), x € #. Then show that
Pz(x,x'|susi)< 2 VY YL Dd IT alxe)atxa)(E /P(V|Xn> Sn)P(Y [Xn 5)
By Bi Xs Be’ Xn Xn’ n=1
(c) Define the “super state”
S, a (S,.S,) € {A,, A,, coo Axaus-w} =A
348 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
where K = |2| is the number of channel input letters and A are the K?“~ !) distinct “super states.”
Defining x = (x, x’), q(x) = q(x)q(x’), and super-state transition expression §,,; = g(X;,, §;), show that
Pex x8) << DEY [a6 aE VPs. 53). 53]
Ry Ro RN n=1
=[1 1+ 1] ANi@8)
where i(s ,) is the (K?“ ~ '))-dimensional column vector with “ 1” in the position corresponding to state
s, and “0” elsewhere, and where
A= {a;;}
is the K?~“~) x K?(¥—») matrix with
ae | f(x, Aj) if A; = g(x, A;) for some & = (x, x’)
ea otherwise
and
F(% 8) = a00(E VeCoT s)oO1x.8)]
(d) Verify that Theorem 5.8.1 generalizes to this general finite memory channel.
CHAPTER
SIX
SEQUENTIAL DECODING OF
CONVOLUTIONAL CODES
6.1 FUNDAMENTALS AND A BASIC STACK ALGORITHM
In the last two chapters, we described and analyzed maximum likelihood decoders
for convolutional codes. While their performance is significantly superior to that
of maximum likelihood decoders for block codes, they suffer from the same disad-
vantage that computational complexity grows exponentially with constraint
length. Thus, even though error probability decreases exponentially with the same
parameter, the net effect is that error probability decreases only algebraically with
computational complexity. The same is true for block coding, but of course the
rate of decrease is much greater with convolutional codes.
This situation could be improved if there were a way to avoid computing the
likelihood, or metric, of every path in the trellis and concentrate only on those
with higher metrics which presumably should include the correct path.’ It is
practically intuitive, based on our previous analyses, that while an incorrect path
is unmerged from the correct path, its metric increments are much lower than
those of the correct path over this segment. We can support this observation
quantitatively by again considering the ensemble of all possible convolutional
codes of a given constraint length for a given channel. Let x and x’ be the code-
vectors for the correct and an incorrect path over a segment where the two are
unmerged, and let y be the received output vector from the memoryless channel
‘ An extension of a given path is regarded as another path.
349
350 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
over this segment. We now indicate the nth symbol of each vector by the subscript
n. Suppose we arbitrarily choose for our metric
M(x) = d m(x,,) (6.1.1)
where
= jp [POnl Xn) | _
m(x,) = 1 iS R (6.1.2)
and where we define
P(Yn) = DX, 4(%n)P(Vn| Xn) (6.1.3)
and q(x) is the arbitrary weighting distribution imposed on the code ensemble
(see Sec. 5.1}.
We note, first of all, that this choice of metric is consistent with the maximum
likelihood metric used previously. For, in maximum likelihood decoding, only the
difference between the metrics of the paths being compared is utilized. Thus, as we
previously defined in (4.4.1) and (5.1.1), the metric difference is
n
= ry hae
AM(x, x’) = M(x) — M(x’) = > In pas (6.1.4)
where the sum is over the symbols in the unmerged span. Consequently, the terms
p(y,) and R do not appear in the metric difference, and hence are immaterial in
maximum likelihood decoding. On the other hand, in any algorithm which does
not inspect every possible path in making a decision but must choose among
paths of different lengths, these terms introduce a bias which is critical in optimiz-
ing the performance of the algorithm.” To illustrate the effect of these terms,
consider the average metric increase for any symbol of the correct path. As usual,
we take both the expectation with respect to the channel output conditional
distribution p(y, |x,) and the ensemble average with respect to the input weighting
distribution q(x,). Thus we have
E,,, yal!(Xn)] = )) Q(Xn) Y) P(Yn| Xn)m(X,)
. : ein [Pal Xn) | _
= a d q(Xn)P(Vn| Xn) eee R|
= I(q)—R (6.1.5)
and, if we choose the weighting vector q to maximize I(q) and thus make it equal
to channel capacity
E.. , jm(x,)} = C= R >0: > forall R <C (6.1.6)
2 Massey [1972] has given analytical justification that the metric (6.1.2) is the optimum decoding
metric. This metric was first introduced by Fano [1963] and is referred to as the Fano metric (see
Prob. 6.7).
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 351
On the other hand, for any symbol on an unmerged incorrect path
Ex,, xin val™(Xn)] = 4(Xn) DL DL PUn| Xn) a(%n)n(X>)
yn Xn
=y2d ats) Pele = R|
= + P(Yn)
o P(YnlXn) |-
< d d q(X;)P(Vn) rae i —R
where we have used (6.1.3) and the inequality In x < x — 1. Then since the sum-
mation in the last inequality is identically zero we have
E [m(x,,)] < —R (6.1.7)
The reason that we had to average over the weighting of x,,, the corresponding
symbol of the correct path, is that the distribution of the channel output y, is
conditioned on it.
Thus, we have the heuristic result that the “average” metric® increment per
symbol of the correct path is always positive for R < C, while on an unmerged
incorrect path it is always negative. Obviously, any bias term less than C could be
used in place of R, but this choice minimizes the computational complexity. The
main conclusion to be drawn from this is that, on a long constraint length con-
volutional code, it should be possible to search out the correct path, since only its
metric will rise on the average, while that of any unmerged incorrect path will fall
on the average. By making the constraint length K sufficiently long, the fall in
any unmerged span can be detected and the path discarded, usually soon after
diverging.
Before we can substantiate these heuristic generalities, we must describe an
algorithm which somehow recognizes and utilizes these properties. We begin by
defining a sequential decoding algorithm as an algorithm which computes the
metric of paths by extending, by one branch only, a path which has already been
examined, and which bases the decision on which path to extend only on the
metrics of already examined paths.
Probably the most basic algorithm in this class, and certainly the simplest to
describe, is the stack sequential decoding algorithm whose flowchart is shown in
Table 6.1 for a rate b/n convolutional code. We adopt the notation j(u, w) for the
branch metrics, which consist of the sum of n symbol metrics and depend on the b
data symbols w of the given branch as well as on the (K — 1)b preceding data
symt-dls u of the path which determine the state of the node. Thus the algorithm
creates a stack of already searched paths of varying lengths, ordered according to
their metric values. At each step, the path at the top of the stack is replaced by its
Xn» Xn» Yn
3 This average, of course, is over the ensemble of codes defined by the arbitrary weighting distribu-
tion q(x). From this we can not necessarily conclude at this point that the same will be true for a
particular code. To deduce this from the ensemble average can only be considered a heuristic
argument.
352 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Table 6.1 Stack algorithm flow chart
Initialize by placing initial node
with 0 metric in stack
_—
ee
\
Replace the top path u and its metric
by its 2? successors with
augmented metric*
Muy = My a Muw
00..0
OO u24
Cae
Vv
w-"7
If any of the 2? newly added paths
merges with path already in stack,
eliminate the one with lower metric
'
Reorder stack
according to metric values
Is node at top of the stack
at the end of trellis?
Output path for top node
stop
* The metric subscripts u and w indicate data vectors; in the next section we shall identify
metrics by their code vectors x used as arguments of M(-).
2° successors extended by one branch, with correspondingly augmented metrics. If
any one of the newly added paths merge with any other trellis path already in the
stack, the one with lower metric is eliminated. The algorithm continues in this way
until the end of the trellis is reached.
An example of the basic stack algorithm search is illustrated in Fig. 6.1 which
shows the tree and path metrics for the K = 3, r = 4, convolutional code, first
studied in Chap. 4 (Fig. 4.2), transmitted over a BSC with p = 0.03. To determine
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 353
Information 1 0 ] 0 0
Transmitted sequence 11 10 00 10 11
Received sequence 01 10 Ol i: ieee ie
00 00
Pg coomeeee HL
10
11
00 aa
-18 10
11 a
01
00 ey
~9 11
10 aa Ole
10
00
01
1]
; -18 01 =
A
01
10
10 Fig
00
0 __ 00 ll
11 +25 10
-16}] 11
Y ever Lie
10 25
oe
i 008-12
ce gre 34
161.01 10
1] =36 00
— 11
01 ae
01 = a
11
~29 01 00
10
7 01
10
Stack contents after reordering
(path followed by metric)
1 0. +97 55-9
2 Aj 920); S185 OF IS
3-.10,,-7:.00,. 18; 01, —18;:14, -29
4 100, —16; 101, —16; 00, —18; 0), —18;.11, -29
S* 10) +16: 06.18: 01; 18-1000; 25; 1001, 25: 11, -29
6:/ 1010; +14: 00, —18: 01, —318;.1000, —25; 1001, -—25;: 11, -29;
1011, -36
7 30100; —12: 00, -18; 01,—18; 1006, —25; 1001, —25;
11, —29; 10101, —34; 1011, ~36
Figure 6.1 Stack algorithm decoding example.
354 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
the symbol metrics, we note first that R = r In 2 = 0.347 and p(y,) = 4 for y, = 0
and 1. Thus from (6.1.2) we have
m(x,,) = fin [2(1 — p)]} -R=0.316 forx,=y,
" \In 2p) — R = —3.160 for X, # Yp
Since the order of the search is unaffected if the symbol metrics are all multiplied
by the same positive constant, we may equally use
eel lor’ + = y;
~\~10 for x, # y,
and thus simplify the bookkeeping and the diagram. Figure 6.1 shows the first seven
steps of the search, indicating the path and metric values after each new pair of
branches have been searched and the stack reordered. Since no two equal length
paths with the same terminal state appear in the stack shown through the seventh
step, no eliminations occur due to merging up to this point. If, for example, the
correct data sequence had been 10100 so that the received code sequence
contains two errors, in the first and third branches (underlined), then it appears
that by the fifth step the correct path has reached the top of the stack and remains
there at least through the seventh step. Assuming that, of the paths shown in
Fig. 6.1, no path other than the top path is further extended (which would cer-
tainly be the case if no further errors occurred), we see that, from the third node
(where the trellis reaches its full size) through the sixth node, only eight branch
metric computations were required by this sequential stack algorithm as
compared to the 24 computations required in maximum likelihood decoding.
Obviously this comparison becomes ever more impressive as the code constraint
length grows.
Nevertheless, each step of the sequential stack algorithm does not necessarily
advance the search by one branch. It is clear from the example that the number of
incorrect paths searched varies from node to node. At each node of the correct
path, we define the incorrect subset to be the set of all paths which diverge from the
correct path at this node. For a rate 1/n code, exactly half the paths emanating
from a given node are in its incorrect subset. In the example, assuming again that
no further search occurs within the first six branching levels, we see that three
paths were searched in the incorrect subset of the first node, one path in that of the
second node, and three in that of the third node. Let us define a branch computa-
tion as the calculation of the metric of a single path by extension of one branch of
a previously examined path. Thus the number of branch computations per node
level is just one more than the number of branch computations in the incorrect
subset of that node. As shown more generally in Fig. 6.2, there are C ; paths
(branch computations) in the incorrect subset of node j, and hence C; + 1 compu-
tations required ultimately to reach node level j + 1 from node j without ever
again retreating.* Clearly C; is a random variable, but, as we shall see in the next
m(X,)
* Note that the jth incorrect subset may be revisited at any later time, but we take C ; as the total
number of branch metrics computed in this subset over all visits.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 355
Incorrect subset 2’ (1)
C, paths
on™
Correct path
Incorrect subset 9’ (3)
C3 paths
~ =
~—e ae
Incorrect subset 2) Figure 6.2 Incorrect subsets for first
C, paths three nodes.
section, its distribution is independent of the constraint length of the code,
although it does depend on the rate R. Equally important is the fact that, even
though this algorithm is suboptimal, asymptotically for large constraint length it
performs essentially as well as maximum likelihood decoding.
We examine the distribution of the number of computations in Sec. 6.2 and
the error probability in Sec. 6.3.
6.2 DISTRIBUTION OF COMPUTATIONS: UPPER BOUND
Let x be the correct path through the trellis and let x; be any incorrect path which
diverges from x at node j; that is, x; is a path in the incorrect subset of node j.
Further, let M[x(i)] be the metric up to node i of the correct path, and let M[x‘(k)]
be the metric at node k of x; where k > j. The number of computations in the jth
incorrect subset will depend on the relative values of the metrics M[x;(k)] for all
incorrect paths in the subset and on M[x(i)] where both k >j andi > j. Precisely,
we have the following condition:
356 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Lemma 6.2.1 The incorrect path x; in the jth incorrect subset may have to be
searched beyond node k > j only if
M[x(k)] => min M[x(i)] = y; (6.2.1)
izj
ProoF A path is searched further if and only if it reaches the top of the stack.
We may assume node j on the correct path has been reached; otherwise the
incorrect subset for node j will be empty. The algorithm guarantees that, if
M[xi(k)] < M[x(j)], then this incorrect path x; cannot be searched further
until after x has been searched to a point at which its metric falls below
M[x;(k)], and hence its position in the stack falls below that of x;. But if
min M[x(i)] > M[x;(k)]
> J
then this never happens and consequently the incorrect path in question is
never searched again, which proves the lemma. Note that we have ignored
mergers, but since Lemma 6.2.1 is only a necessary, and not a sufficient,
condition for further search, the side condition which causes the pruning of
merging paths can be ignored.
This lemma is all that we need to determine the upper bound on the distribu-
tion of computation in the jth incorrect subset, which we henceforth denote 2"()).
We note first that the number of computations C; in this subset will exceed L only
if L paths in 2"(j) satisfy condition (6.2.1). Hence
Pr (C,>L} <Y ply|x)o,(L) (6.2.2)
where the received code vector> runs over all symbols beyond node j, and
1 if M[xj(k)] => y; for at least L paths xj(k) € 2’(j)
,(L) = (6.23)
0 otherwise
We proceed to upper-bound (6.2.3) by noting that if, for a given y, @/(L) = 1, then
by definition
M[x;(k)] —y; =>90 for at least L paths x;(k) € 2”(j)
and consequently
eA for at least L paths x‘(k) € 2’(j), a > 0
and is nonnegative for all other paths. Thus, summing over all paths in the
incorrect subset, we obtain that for any y for which ¢,(L) = 1
eMIx(ol-7} > TD for anya>O0
x(k) € XC)
° Notation and discussion is simplified if we do not specify the dimensions of vectors; these are
either implicit or specifically designated after each equation.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 357
Equivalently,
p
e y etMixOl-7)}]! S1= 9g (L) foranya>Oandp>0 (6.2.4)
L x'{k) € X'(j)
The inequality (6.2.4) also holds trivially (as a direct inequality without the
intermediate unity term) for y such that ¢,(L) = 0. Also, from the definition (6.2.1)
of y;, it follows that
e77P?j — exp \-3p oF M{x()]|
and hence
en 2P < Fe eM ati (6.2.5)
i=j
Thus combining (6.2.4) and (6.2.5), we have
p @
jo y aca > e7 2PMIx(i)] > ,(L) (6.2.6)
x ;(k) e X'(j) i=j
for all y and any «> 0, p >0. Substituting into (6.2.2) we obtain
Lemma 6.2.2 The distribution of computation in the jth incorrect subset is
upper-bounded by
. ; ee >0
Pr {C, dee dL? Dlr ix 9 2eMIxia] | Mixon a
7 - A 2 ae Lea ie |
(6.2.7)
Note, of course, that the metrics M[ - |, as defined by (6.1.1) and (6.1.2),
are functions of y as well as of x or x’.
To proceed, we again consider the ensemble of time-varying convolutional
codes, first described and used in Sec. 5.1. Averaging over this ensemble, and
arguing just as in (5.1.14) and (5.1.15) by restricting p to the unit interval and using
the Jensen inequality, we obtain®
ns | a>0O
Pr 1 oF a L} <i7e Ply x) e~ 2eMIx(i)] SEMI
: ~ | X pea O<p<i
and where the first and second overbars on the right side indicate averages with
respect to the weighting distributions q[x(i)] and q[x‘(k)], respectively. Finally,
© Note that in taking this ensemble average, we are again ignoring possible merging of the correct
and incorrect paths. But, if merging occurs, we would not need to make any further computations on
the incorrect path in question; thus ignoring merging merely adds additional terms to the upper
bound, which is therefore still valid.
358 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
recognizing that in a rate b/n code, ignoring mergers, there are less than 2?" /)
paths xj(k) for k > j and that all averages are the same, we obtain, with the aid of
inequality (g) of App. 3A
Pr iC, > Dix >», Ply | x) y e 2PMIx(i)] 3 bk Def paMlx {kN \p
; ‘me P
i=j k=j
To simplify the notation, we let t = i — j and t = k — j and summarize the above
results as
Lemma 6.2.3 The ensemble average computational distribution in the jth
incorrect subset is upper-bounded by
Pr cc fae 7. T(t, T) (6.2.8)
t=0 t=0
where
a>0O
T(t, = 2btp x(t x(t — apM[x(t)] x eM cyl”
X & plylx( Nalx(t)Je dL alx'(c 7) oe
(6.2.9)
and where x(t) and x’(t) are codeword segments of t and t branches,
respectively.
This bound is clearly independent of j. To evaluate (6.2.9), it is necessary to
distinguish the cases t < t and t > t. As shown in Fig. 6.3, the former case corre-
sponds to the case where the correct path segment under consideration is longer
than the incorrect, and vice versa for the latter. Then, since the channel is mem-
oryless, it follows from the definitions (6.1.1) through (6.1.3) that, for t<t
Tig t= 2 ef eee ae (6.2.10a)
where
eB = SY gly) Poles) (6.2.11)
eter = ST gsdotyla)|E aey(BPU) (62.02)
ey p(y |x)
while for t >t
T(t, 1) = 20% @~ En ee Om g—tnEcr(a, p) (6.2.10b)
where
oF) = Sy) lS ae) Pee p(y|x’) ,- ay (6.2.13)
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 359
Correct path
Incorrect path
Node
ee
iN ERG ent oy Sea at
(a)t>T
Correct path
Incorrect path
|
|
|
l____ Node
Figure 6.3 Relative node depths for
(b)7>t Eq. (6.2.9).
It should be clear that the single subscripts C and J correspond to segments which
contain only the correct or incorrect path branches, respectively, while the double
subscript CI corresponds to segments which contain branches of both the correct
and incorrect paths (see Fig. 6.3).
Thus, since R = (b/n) In 2, we may rewrite T(t, t) as
T(t, t) = (exp [—ni(t — t)Ec(a, p) + t[Eci(%, p) — pR}}] t<t
7 (exp [—ni(c — t)[E,(x, p) — pR] + tlEci(%, p) — pR]}] tot
(6.2.14)
Finally, applying the Holder inequality (App. 3A) to each component of the expo-
nents, using the definitions (6.2.11) through (6.2.13), we find
e Ect, p) < etPR— (1 — ap)Eolap/(1 — ap)] = dc (6.2.15)
e~ (Ena, p)- PRY < gA(1—a)R- apEdl(1—a)/a] = 5 (6.2.16)
e7 lEci(a, p)— PRI < geR-(1—ap)Edlap/(1—ap))— 2pEol(1 —a)/a] — § 5, (6.2.17)
360 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
where 0 < ap < 1 and where E,(p) is the Gallager function’ of (3.1.18). Thus
T(t, t) <6¢ Of ~=~—s for allt > 0, 7 >0 (6.2.18)
Now in order for the double summation (6.2.8) to be bounded, we must have
dc < 1 and 6, < 1; but according to (6.2.15) and (6.2.16)
dc <1
— E
| eae a cei E, mat or r= ol?) where y = AB 350
ap 1 —ap y 1 —ap
6; <1
fo Re E,(=*) or R < Ed) EE 0 a >0
l-—«a ow ou
Thus for « = 1/(1 + p), both conditions reduce to
E
dc < 1,6,<1 if R < Pale) 0<p<1 (6.2.19)
Finally, choosing p such that
E,(p)
p
we have from (6.2.8), (6.2.18), and (6.2.19)
Pric-> t= Ape
R=(1-e) e>0
where
1
A= O<p-<1
(1 — o2)(1 — 53) ‘
Thus, we may conclude with
Theorem 6.2.1 There exists a time-varying, rate b/n, convolutional code
whose distribution of computation in any incorrect subset (and hence of
computation required to advance one branch) is bounded by
Pico lL oA Veg) (6.2.20)
where A is a constant and op is related to the rate R = (b/n) In 2 by the
parametric equation
Ry < 1 te & (6.2.21)
rare. 2
’ Maximization of the exponent with respect to q(x) is implied here.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 361
and ¢€ is any positive constant. The distribution described in (6.2.20) is called a
Pareto distribution. Note that the power p goes to unity as R > Ry and to zero
as RC when we let « > 0.
Obviously, the condition (6.2.19) also yields an upper bound for lower
rates (R < Ry). We may take p = 1 so that
A
Pr {€ > L} < ies: (6.2.22)
However, one would expect a more rapid decrease with L for low rates, and in
fact, if we remove the linearity condition on our code, a tighter result can be
proved. Precisely, for a time-varying trellis code,® it can be shown (Savage
[1966]) that
Paf{Czij< AB? .0<p <0 (6.2.23)
where
E,(p)
R=(1-e) 0<R<C(l-€6)
We shall show in Sec. 6.4 that this is the best possible computational
distribution by deriving a lower bound for any sequential decoding algorithm.
But before we do this, in Sec. 6.3 we upper-bound the error probability for
this algorithm, to show that it is asymptotically optimum for large K.
6.3 ERROR PROBABILITY UPPER BOUND
The calculation of an upper bound on node or bit error probability for sequential
decoding is almost the same as that for the distribution of computations. We now
concentrate on the merging of paths, but rather than consider the probability that
an incorrect path metric exceeds the correct path metric upon merging, we recog-
nize that an incorrect path in the jth incorrect subset does not even get a chance to
reach the merging point if its metric at the merging point is below the minimum
metric of the correct path after node j. That is, consider the incorrect path x‘(k)
which diverges from the correct path at node j and remerges at node k. If
M[x;(k)] < min M[x(i)]
al
then the incorrect path does not even get a chance to be compared with the correct
path at this point.? Alternatively, we can state this in the same form as Lemma
O24;
® A general time-varying trellis code of constraint length K can be generated by the same K-stage
shift register(s) as a convolutional code, but with time-varying arbitrary logic (“ and” and “ or” gates)
in place of linear logic (modulo-2 addition).
° This implies also that the step in the Stack algorithm (Table 6.1) which eliminates merging
paths may be omitted. (See Sec. 6.5.)
362 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Lemma 6.3.1 An error may be caused by selecting an incorrect path x‘(k)
which diverged from the correct path at node j and remerged with it at node k
only if
M[xi(k)] = min M[x(i)] = y; (6.3.1)
i>j
Again this condition is necessary, but clearly not sufficient for an error to
occur.
From this point, much of the derivation closely follows that of the previous
section. There is, however, one important difference. While x(k) in Sec. 6.2 repre-
sented any path in the jth incorrect subset, it now represents only such a path
which merges at node k. Thus the steps leading to Lemma 6.2.2 are essentially the
same, as is the lemma itself, but now the set 2’(j) must be replaced by the union of
subsets |) j4« 2’(j; k) < 2'(j) where 2’(j; k) contains all paths in 2’(j) which
remerge with the correct path at node k. We can thus prove
Lemma 6.3.2 The node error probability at node j is upper-bounded by
<2 ply |x) pics ate y | ‘3 7 .
k=j+K \x'(kye2'(j; k) 0<p<l
(6.3.2)
and the expected number of bit errors caused by a path which diverged at
node j is upper-bounded by
<2, ply|x) be geet) SS. ER cod = (Kero
k=jt+kK
4 emixyen\ > 0
(63.3
lx (hy BG; k) O0< ps 1 )
ProoF Let P,(j; k) be the probability of an error at node j caused by a path
which remerges at node k. Analogously to (6.2.2), we have, using Lemma 6.3.1
k) <¥. ply|x)oy(1) (63.4)
where recall from (6.2.3) that
g,y(1) = | I if M[x;(k)] Pg Yj for some xi(k) ie #1; k)
io otherwise
Thus if for a given y, ,(1) — 1, then
etMIxi(—7j} > ]
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 363
for some x(k) € 2’(j; k). Hence for this y
j e a>0
fe eet = oii 6.3.5
x(k) €X'(j; k) o(1) p>0 ( )
while if @,(1) = 0, (6.3.5) holds trivially without the intermediate unity term.
At the same time, e *°’) may be bounded just as in (6.2.5). Thus substituting
(6.3.5) for @,(1) and (6.2.5) for e~*””) yields
PN = py ine a yaaa 638)
“J x i(k) € B'(j; k)
It takes at least K branches for a path to remerge; thus
PAj)< ), Pi; k) (6.3.7)
k=j+K
Combining (6.3.6) and (6.3.7) yields (6.3.2). To find the expected number
of bit errors caused by such a node error, we observe, as in Sec. 5.1, that the
number of bit errors caused by an incorrect path unmerged for k — j branches
cannot be greater than b[k — j — (K — 1)] [since the last K — 1 branches, or
b(K — 1) symbols, must be the same as for the correct path]. Thus
Elm (i) < ¥, bk —7- (K — UIP; k) (6.38)
Vee =j+K
Combining (6.3.6) and (6.3.8) yields (6.3.3), and thus proves the lemma.
If we now proceed as in Sec. 6.2, by averaging (6.3.3) over the same code
ensemble, restricting p to the unit interval and applying the Jensen inequality, we
obtain
E{n,(i)] < ¥. ply |x) Pea y kj K+ 1)
y k=j+K
Sag ee ene
lea) E24; k) O<p<l
But the set 2’(j; k) of incorrect paths diverging at node j and remerging at node k
contains no more than (2? — 1)2°"*~)~* paths, since the first branch must differ
from the correct path while, of the remaining (k —j — 1) branches, the last
(K — 1) branches must be identical to it. Thus, since the same weighting distribu-
tion is used for all path branches
E{n,(j)] ay p (y|x) J e~ #0MIxt)
ise |
x y b(k —j- aes 1)(2° aes 1)P2G— A KlofoaMixj@Np
k=j+K
364 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Finally, since by (5.1.32)
ae —
ne
letting t = i — j and t = k — j, we have
Lemma 6.3.3 The ensemble average bit error probability is upper-bounded
by
P,<(2?—1p27->**S ¥ [rt —(K — 1)] T(t, t) (6.3.10)
t=O t=K
where
T(t, t) = 2°” YY) p(y| x)a(xye- 27M= » dx (cyjemeonl 29
y: = ‘(t) 0< ps 1
(6.2.9)
But T(t, t) is identical to the function defined in Sec. 6.2, and we have shown
there that
T(t,t) < 6¢67,— for allt > 0,7 >0 (6.2.18)
with
1
6c < 1,6, < 1 iG and R < =a) (6.2.19)
Thus letting R = (1 — €)E,(p)/p, we have that the sum of (6.3.10) is bounded by
(2° — 1)6*" Bote O0<p<l
Py < (1 — 62)(1 — 67)? R,(l-«6)<R<C(l-—@
(6.3.11)
For R < Ro, as usual we choose p = 1. Thus, using the terminology of Sec. 5.1
(5.1.34)
ER) = E,(p)
(1 5 e)E,(p)
p
and taking ¢ = |In 6,|/E,(R), we have the following theorem.
oa
Theorem 6.3.1: Error probability with sequential decoding (Yudkin
[1964]) The ensemble average bit error probability of a sequentially decoded
time-varying convolutional code of rate b/n is upper-bounded by
Se?) San 0<R<R,.(1-€)
P< 6.3.12
° | D2~ KOEARIIR R.(i-—«)<R<C(l—e) (
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 365
where E,(R) is given by (5.1.34) and’°®
2>-1 2>—1
D = (1 Bie — ae BI ae 9 raid 08 ™ [1 4 ¥ ateietles
(6.3.13)
The exponent of (6.3.12) has the same form as that of (5.1.32), the upper
bound for Viterbi decoding at high rates'’ [except that ¢ here is related to 6,
whereas in (5.1.32) it is an arbitrary positive number]. On the other hand, for
lower rates R < Ry, the exponent is reduced from Ro/R > | to unity. It would be
possible to increase this exponent, as well as that of (6.2.22), by a different choice
of bias term [R replaced by Ro(i — €)| in the metric (6.1.2), but only at the cost of a
worse distribution of computation at higher rates (see Prob. 6.2).
There remains one issue to resolve. Although we proved in Sec. 6.2 that there
exists a code for which Pr (C > L) < AL °, and although it follows from Theorem
6.3.1 that there exists a code for which P, < P, is bounded by (6.3.12), these
bounds may not both hold simultaneously for the same code. The resolution of
this dilemma is arrived at by an argument similar to that used in Sec. 3.2.
Assuming, for the moment, a uniform weighting of the ensemble, there exist « and
B on the unit interval such that all but a fraction « of the codes satisfy
Pr {C21} <“L* (6.3.14)
while all but a fraction f of the codes satisfy
P, < P,/B (6.3.15)
Thus, at most a fraction a + f fail to satisfy at least one of these bounds, and con-
sequently a fraction (1 — a — B) must satisfy both. With nonuniform ensemble
weighting, an essentially probabilistic statement must replace this simple argu-
ment. In any case, there exists at least one code which, within unimportant multi-
plicative constants, simultaneously satisfies the upper bounds of both Theorems
6.2.1 and 6.3.1.
6.4 DISTRIBUTION OF COMPUTATION: LOWER BOUND
We now proceed to show that the upper bound of Theorem 6.2.1 for convolu-
tional codes is asymptotically tight at least for R > Ry, and that the result (6.2.23)
for trellis codes is asymptotically tight for R < Rp as well. The proof is based on
comparing the list of paths searched by a sequential decoder with the list of the
L paths of highest metric for a fixed block decoder, and employing the lower
bound of Lemma 3.8.1 on list-of-L block decoding.
‘© The last inequality in (6.3.13) follows from the choice of € and substitution of (6.2.19) in
(6.2.15) and (6.2.16).
11 For systematic codes, the exponent is reduced by the factor of 1 — r = 1 — R/In 2. This is shown
by applying the same arguments as used in Sec. 5.7.
366 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
We begin by considering a sequential decoder aided by a benevolent genie
who oversees the decoder action on each incorrect subset. If any incorrect path of
the jth incorrect subset is searched to length | branches (N = In symbols) beyond
node j, the genie stops the decoder and informs it to stop searching this path.
Provided no decoding error is made at the jth node, the distribution of computa-
tion on the jth incorrect subset Pr {C; > L} is lower-bounded by the probability
that the genie stops the decoder L times. For first of all, a computation has been
defined as a branch computation, and L is just the number of computations on the
last branch of all the incorrect paths stopped by the genie. Hence we are ignoring
all but the last branch of each path in computing the lower bound. Furthermore,
many other paths may have been searched, but not to depth / branches. Finally, if
the genie were not present, the incorrect paths might be allowed to continue for
even more operations; but if no errors are made in the jth subset, we will ulti-
mately return, just as if the genie were present. Thus in the absence of errors
Pr {€, 21> Pb) (6.4.1)
where
ies | genie stops decoder at depth / of jth incorrect |
= 6.4.2
PAL) |subset more than L times | cr
Naturally, when the correct path arrives at depth | beyond node j, it is allowed
through. Thus suppose we construct a list 2,(j) of the first L paths (incorrect or
correct) emanating from node j of the correct path and examined by the genie at
node j + |. Then letting 2, (j) be the complementary set of 2°‘ — L paths not on
this list, the probability that the genie stops the decoder more than L times for a
given received vector y is
P,(L|y) = Pr {x € ¥,(/)|y} (6.4.3)
where x is the correct path over the given /n-symbol segment. Then, since all 2”
paths of this length emanating from a common node j are a priori equiprobable,
it follows that
1
PALY=Y D_ spiwly |x) (6.4.4)
y xeX,(j)
This procedure should remind us of the list-of-L decoder described in Sec. 3.8
and Prob. 3.16. A maximum likelihood list-of-L decoder for a code of M code-
vectors produces a list consisting of the L code-vectors with highest likelihood
functions (metrics). Suppose the number of code vectors is M = 2°". Let the list of
the L most likely code-vectors be denoted #(L) and the complementary set,
consisting of the 2’! — L code-vectors not on the list, be denoted 7(L). Then for
any block code of 2” a priori equiprobable code vectors of length N, the block
error probability of such a decoder is
PALY=Y YL su Pvly|x) (6.45)
y xeXL)
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 367
Now the genie-aided sequential decoding of all paths of length N symbols emanat-
ing from the jth node can also be regarded as a decoding operation on a somewnat
constrained (truncated convolutional) block code of 2” vectors. However, while it
does produce a list-of-L output, this list does not correspond to maximum likeli-
hood decoding. Thus it follows that for every x, € 2,(j), there is some x, € W(L)
such that
Xq € ¥,(j)
Pw(¥|Xa) < Px(¥|Xs) x, € &L)
Since UL) + UL) = X,(j) + %_(j) it follows that if, for a given y, we sum the
2°' _ [, elements of the complementary sets, then
» prly|x)=> Y py(y|x) (6.4.6)
xe 2,(j) x € ML)
since W@(L) consists of the (2°' — L) vectors with lowest likelihood, while 2, (/)
may have some elements which are contained in W(L).
Finally combining (6.4.1) and (6.4.4) and employing (6.4.6) to compare this
with (6.4.5), we have
PriC, >) > PAL) =e 2 py |x)
y xe21(j)
=o >» 27 "py(y |x)
y xeWL)
= P,(L) (6.4.7)
At this point we may use Lemma 3.8.1 which lower-bounds the list decoding
error probability P,(L) = P,(N, 2°, L) to obtain
Pri 2h) > glee (3.8.2)
where
eae | s Fags
K= au ) N =nl (3.8.3)
and
E.,(R) = E,(p) — pE,(p) O<p<o
i (3.6.46)
R = E((p) OA R<C
To utilize this result, we must choose / or N = nl, the genie’s vantage point.
Suppose we arbitrarily pick‘?
pinL
nl = nein = ; (6.4.8)
' E,(p) — pE,(p)
‘* The connection between /,,,, and k,,;, of (5.5.5) is noteworthy (see Prob. 6.5).
368 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
Combining (3.8.2), (3.8.3), (3.6.46), and (6.4.8), we obtain!
Pr {C; => L} > e- nlcritLEo(p) — pEo'(p) + o(Icrit)]
— g~ Pin L{1 +o(L)]
= ili oa. Da p< (6.4.9)
To determine the relationships between rate and p, we have from (3.6.46), (3.8.3),
and (6.4.8)
In aaa In Er
E‘(p) = R= ~
o(p) Nl cit Nerit
E, ’
= R-— eee) s E.)|
p
Thus
E
Ra “ 0<p<a (6.4.10)
Hence we obtain the following theorem.
Theorem 6.4.1: Computational distribution lower bound (Jacobs and Berle-
kamp [1967]) The computational distribution of any convolutional (or
trellis) code, on any incorrect subset where no decoding error occurs, is lower-
bounded by
Pr {(C > L}. > L*[1 —o(L)] (6.4.11)
Ep) O<p<o
where R= - jor oe
(6.4.12)
Thus, by comparing Theorems 6.2.1 and 6.4.1, we see that the bounds in both
theorems are asymptotically tight for Ro < R < C. For lower rates 0 < R < Ro,
the lower bound (6.4.11) has been shown to be asymptotically tight only for
time-varying trellis (nonlinear convolutional) codes. For linear convolutional
codes, the lower-bound exponent of (6.4.11) does not agree for the lower rates
with the upper-bound exponent of (6.2.22). It is not known whether either bound
is tight.
Given the significance of this Pareto distribution for the operation of a se-
quential decoder, it is worthwhile to examine how the key parameter p, known as
the Pareto exponent, varies with the channel probability distribution for specific
commonly used channels. For the BSC derived from the binary-input AWGN
*S Here o(L) ~ 1/,/In L.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 369
channel by hard quantization of the channel output (J = 2), the function E,(p),
first derived in Sec. 3.4, depends only on the symbol energy-to-noise density &,/N,
[see (3.4.1) with p given by (3.4.18)]. Then solving the parametric equation (6.4.12)
for p, with various values of code rate r = R/In 2, results in the curves shown in
Fig. 6.4, where p is plotted as a function of &,/N, = (6é,/N,)/r in dB.
Of considerable interest is the behavior of the decoder when soft (multilevel)
quantization is used on the AWGN channel output. Figure 6.5 shows the corre-
sponding results for the octal (3-bit) quantizer of Fig. 2.13 and the corresponding
channel of Fig. 2.14, with the quantization step a = 0.58,/N,/2. For this case,
E,(p) is obtained from the general expression (3.1.18) with the transition probabil-
ities given by (2.8.1). We note that the improvement over the hard-quantized case
is very nearly 2/2 (2 dB), the same improvement factor found for small &,/N, in
Sec. 3.4 (Fig. 3.8).
Note also that p = 1 corresponds to R = Ry = E,(1) = E(O), and thus the
intercepts of the line p = 1 for each curve in Figs. 6.4 and 6.5 can be derived from
the J=2 and J=8 curves of Fig. 3.8(b) by finding the point at which
E(Q)=:R = r in._2,
4.0 T T T T
3.5 -
3.0 a em 1/3 aa
r=1/2
= r=2/3
5 r= 3/4
& ZF ay 2
o
°
D
=
A.
2.0 4
1S 4
l | i |
mn 5 6 ,, 8 9
&,/No, dB
Figure 6.4 Pareto exponent versus &,/N, for an AWGN channel with hard quantization.
370 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
4.0 ] T T
3.5 + =
3.0 4
r= 1/3
Q
= r=1/2
oO
3 r= 2/3
c 25. +
o r= 3/4
°
D
5
a
Zao ro +
Lar _
1.0 | | | see
| 2 3 4 5 6
&,/No, dB
Figure 6.5 Pareto exponent versus &,/N, for an AWGN channel with octal quantization
(a = 0.58,/N,/2).
6.5 THE FANO ALGORITHM AND OTHER SEQUENTIAL
DECODING ALGORITHMS
The basic stack sequential decoding algorithm is a distillation and ultimate
simplification of a number of successively discovered algorithms, each of which
was progressively simpler to describe and analyze. The original sequential decod-
ing algorithm, proposed and analyzed by Wozencraft [1957], utilized a sequence
of progressively looser thresholds to eliminate all paths in the incorrect subset of
node j before proceeding to search the paths emanating from node j + 1. This
technique is mainly of historical importance. The next important step was the
algorithm described by Fano [1963], whose complete analysis appears in the work
of Yudkin [1964], Wozencraft and Jacobs [1965], and Gallager [1968]. From a
practical viewpoint, the Fano algorithm is still probably the most important and
will be discussed further below. Stack algorithms, which form the basis of the
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 371
algorithm treated in Secs. 6.1 to 6.4 were proposed and analyzed independently by
Zigangirov [1966] and Jelinek [1969a]. Also of some tutorial value is the semi-
sequential algorithm proposed by Viterbi [1967a] (Prob. 6.4) and extended by
Forney [1974].
The Zigangirov and Jelinek algorithms are most similar to the one considered
here. They differ, however, in certain features designed to render them more
practical for implementation. First, both ignore merging and thus make no provi-
sions for comparing the metric of a path newly added to the stack with that of a
previously inserted path of the same length terminating in the same state. But this
does not significantly increase the probability of error, for in Sec. 6.3, we upper-
bounded P, by determining the probability that an incorrect path was searched up
to the point of merging. This is tantamount to assuming that errors always occur if
an incorrect path is allowed to merge, and so the already calculated error bound is
valid even if comparisons of merging paths are not performed. As for the compu-
tational distribution, it is possible that, by not eliminating merging paths, more
computations are required, since excess (duplicate) paths are carried along in the
stack. But as we have just noted, the probability of this event is on the order of the
error probability, which decreases exponentially with constraint length, K.
Moreover, the computational distribution upper bound is independent of K,
which suggests that K can be made very large—much larger than for maximum
likelihood decoding, as we shall discuss further in the next section; in that case,
both error probability and the additional computation due to ignoring mergers
will be negligible. From a practical viewpoint, ignoring mergers is very useful, for
carrying out the merge-elimination step in the flowchart of Table 6.1 would con-
tribute heavily to the computation time for each branch.
A much more serious weakness of the basic stack algorithm is that the stack
size, and hence required memory, increment for node j is proportional to C,, the
number of computations in the incorrect subset, and hence it too is a Pareto
distributed random variable. In the Zigangirov algorithm, this drawback is par-
tially remedied by discarding a path from the stack whenever its metric falls more
than a fixed amount f below the metric of the top path. The probability of
eliminating the correct path in this way decreases exponentially with f, so that the
effect on performance can be made negligible.
A third, and possibly most undesirable, drawback of the basic stack algorithm
is that the stack must be reordered for each new entry, requiring potentially a very
large number of comparisons each time. The Jelinek algorithm partially avoids
this by ordering paths only grossly; that is, all paths with metrics in the range
M,, <M <M,,+ A are placed in the mth “bin” and paths in the top bin are
further searched in inverse order of their arrival in the bin (last-in first-out or
“push down ” stack). This requires then that any path not in the bin has its metric
compared only with one metric for all the paths in the bin. The effect of this gross
ordering is easily determined. The basic condition for further search of an incor-
rect path (6.2.1) is modified to become
M[x(k)|}=y,;-A k>j (6.5.1)
372 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
The remainder of the derivation of the computational distribution of Sec. 6.2
follows in exactly the same way, with the result that the additional factor e%°4 is
carried throughout. Since in the final steps we choose « = 1/(1 + p), the final effect
is to multiply the factor A of (6.2.20) by e4”"*, which obviously is asymptot-
ically insignificant. For the same reasons, the same factor also multiplies P, of
(6.3.12).
This brings us finally to the Fano algorithm, which is generally considered to
be the most practical to implement. It too utilizes a sequence of metric thresholds
spaced at intervals of A. Its most desirable feature is that it examines only one
path at a time, thus eliminating the storage of all but one path and its metric.
Basically, it continues to search further along a given path as long as its metric
is growing. Whenever the metric begins to decrease significantly, it backs up and
searches other paths stemming from previous nodes on the already travelled path.
It accomplishes this by varying a comparison threshold in steps of magnitude A.
The threshold is tightened (raised by A) whenever the metric is growing sufficient-
ly on a forward search and relaxed (lowered by A) during backward searches.
This is done in such a way that no node is ever searched forward twice with the
same threshold setting—on each successive forward search, the threshold must be
lower than when it was previously searched.
The details of the Fano algorithm can be explained by examining the
flowchart of Fig. 6.6. In the first block, looking forward on the better node of a
binary tree refers to computing both branch metrics and tentatively augmenting
the current node metric by the greater of the two branch metrics. If the better node
has just been searched and the running threshold, T, violated, the forward look
must be to the worse node. This will occur if the first block is entered from point
(A) , which corresponds to a single pass through the backward search. In either case,
the metric of the node arrived at is compared with T, and if it is satisfied (M > T),
the search pointer is moved forward to that node. The next test is to determine
whether this is the first time this node has been visited in the sequential decoding
search.'* It can be shown (Gallager [1968]) that if this is the case, the metric of the
preceding node will violate T + A. If so, we may attempt to tighten the threshold
by increasing T by integer multiples of A until M < T + A, and continue to look
forward. If the node has been searched before, it is essential that we not tighten the
threshold prior to searching further, for otherwise we may enter a closed loop and
repeat the same moves endlessly.
If, in the first block, upon forward search the new node has metric M < T, we
must enter the backward search mode. This involves subtracting the previous
branch metric from the current node metric. If this satisfies T, then the pointer is
moved back; if the branch upon which the backward move was made was the
better of the two emanating from the node just reached, the worse has yet to be
searched. Thus we return to the forward search via . If it was a worse branch,
there are no more branches to search forward from this node; hence we must
continue the backward search. If upon a backward look the current threshold T is
‘4 If the code tree is of finite length, this is also the point at which we should test for the end of the
tree tail and terminate when this is reached.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 373
Initialize with threshold T = 0
>
y
Look forward to
better node T violated Look back | T violated
or ~ from
if entering via (A) to - search node
worse node
T sati
T satisfied satsind
y 7
Move Move
pointer pointer
forward back Y
Decrease
threshold
v v by A
No First visit Yes Did move
<_ to originate on
this node? worse node?
Yes No
v
Tighten
threshold if
possible (4)
a : Y y
(Forward search) (Backward search)
Figure 6.6 Fano sequential decoding algorithm for binary tree.
violated, we cannot move back. When this occurs, all paths accessible from here
with the current T in effect have been searched and found to eventually violate T.
The threshold is now decreased by A and forward search is again attempted. Note
that, when a node is searched two or more times, each successive time it will be
searched with a lower current threshold; hence endless loops are avoided.
Extensive treatments of the Fano algorithm are contained in Wozencraft and
Jacobs [1965] and Gallager [1968]. Its performance, as well as its analysis, is
essentially the same as that of the stack algorithm. In fact, the threshold increment
A of the Fano algorithm has exactly the same effect as the bin size A of the Jelinek
algorithm. Geist [1973] has shown that, under some weak conditions, the Fano
algorithm always finds the same path through the tree as the stack algorithm. The
only difference is that in the Fano algorithm a path node may be searched several
times, while in the stack it needs to be searched only once. The effect is a modest
increase in number of branch computations, which can be accounted for by an
additional multiplicative factor in Theorem 6.2.1. This disadvantage is usually
more than offset by the advantage of a considerable reduction in storage
requirements.
374 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
6.6 COMPLEXITY, BUFFER OVERFLOW, AND OTHER
SYSTEM CONSIDERATIONS
For maximum likelihood decoding of convolutional codes by the Viterbi algo-
rithm, complexity is easily defined, for if the constraint length is K and the code
rate is b/n, the number of branch metric computations per branch (b bits) is 2*?,
while the number of comparisons, and the number of storage registers required for
path memories and metrics, is 2~'. Thus, if we define complexity for Viterbi
decoding as the number of branch metric computations per bit, that is
DKb
air (6.6.1)
then it follows from (5.1.32) and (5.4.13) that the bit error probability is,
asymptotically’> for large y, and for Rj) < R<C
P, ek 2 — KbEA(R)/R ae yp BARE pi » ea (6.6.2)
where
BAR) EI eo Uap ed
ee | 3 ea Pee 6
We have already seen in Sec. 5.2 that convolutional code behavior as a function of
complexity is much more favorable than that of block codes. In the present
context, for a K-bit block code, we should define y = 2*/K, the number of code-
vector metric computations per bit. Then from (3.2.14) and (3.6.45), it follows that
the block error probability for E,(1) < R < C is asymptotically |
Py ~e7 NER = 2- KERR ~ y-ERIR EVI) R<C (6.6.3)
But since E(R) is significantly smaller than E,(R) for Ro < R < C, the magnitude
of the negative exponent of (6.6.2) is significantly greater than that of (6.6.3); hence
the superiority of convolutional codes with Viterbi decoding.
The definition of complexity for sequential decoding is somewhat less
obvious. One possible definition would be the maximum number of branch metric
computations per bit; that is, the maximum number of computations in the incor-
rect subset of each node, normalized by b, the number of bits decoded for each
node advanced. The problem is that this is a random variable, C/b, and for an
infinite length tree C has a Pareto distribution with no maximum. On the other
hand, for practical reasons discussed further below, we must limit the number of
15 In this asymptotic expression, we ignore all terms which do not depend on K; hence both the
multiplicative constant and ¢ are omitted.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 375
computations in any given incorrect subset or we might never complete de-
coding.'® Thus if we require C < L,,,, for each incorrect subset, we may define
the complexity for sequential decoding as
es Dini
ee:
Then we have from (6.2.20) and (6.4.9) that the decoder will fail to decode a given
node, by virtue of requiring more computations for that node than are available,
with probability
(6.6.4)
Praise ioe | Pa! die xX 4 (6.6.5)
where
E{R) 0<p<l
R Re sk=C€
p=
Thus, interestingly enough, we note by comparing (6.6.5) with (6.6.2) that the
probability of sequential decoding failure, when the number of computations per
branch is limited, asymptotically bears the same relation to complexity as does the
bit error probability for Viterbi (maximum likelihood) decoding. Note, however,
that in sequential decoding, the constraint length K does not appear and it would
almost seem that, since complexity is independent of K, we should make this
arbitrarily large,'’ thus eliminating the possibility of ordinary error and replacing
it by the kind of decoding failure just described.
Comparison of (6.6.2) and (6.6.5), or of (6.6.1) and (6.6.4), suggests choosing
Linax {Or Sequential decoding such that
ENE
where K pertains to Viterbi decoding. Thus, the effective constraint length for
sequential decoding is
log Linax
Ker aT ‘
which measures the effective complexity of the algorithm in the same way as does
the ordinary constraint length in Viterbi decoding.
However, our definition of complexity for sequential decoding is somewhat
misleading, for it is based on the maximum number of computations per bit,
whereas normally, for most nodes, C ; will be much less than L,,,,. This scheme of
limiting C; is also impractical by itself since a decoding failure may well be
1© There is also the issue of the size of the stack, which grows with C for each node; however, by
using the Fano algorithm, all this storage is avoided at the cost of increased computation.
‘7 For other practical system considerations, discussed below, this is really neither feasible nor
desirable, but we might make K sufficiently large that P, is negligible compared to the probability of
failure as given by (6.6.5).
376 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
“catastrophic” in the sense’® that the decoder may possibly never recover from it
to return to correct decoding. These issues suggest that, for any real sequential
decoder, we need to provide some additional features and basic operational
techniques, before we can realistically consider its performance and complexity.
The best place to begin is to establish the size of the memory or buffer which
contains the symbols of the received code-vector y as they await their turn to be
searched by the decoder. Let this be fixed at B branches, or Bn channel symbols.
Assuming a J-ary output channel, or equivalently a J-level quantized continuous
channel, this will require at most Bn[log, J] bits of storage. (This does not include
any of the memory required by the stack but, as already noted, we can avoid this
altogether by using the Fano algorithm.)'? Next, suppose the decoder can per-
form p branch computations during the interarrival time between two successive
received branches—thus yp is the number of computations for every b bit times and
is generally called the decoder speed factor. Then, if the number of computations
required in the jth incorrect subset is
C,;> uB (6.6.6)
it is clear that a failure will occur. For, even if the buffer is empty when the channel
symbols of the jth branch are received, if C j > HB computations are required in
the jth subset, then clearly the received symbols for the jth branch cannot be
discarded for at least C ;/4 > B branch times, since we may need to use these to
compute a metric at any time until the jth branch decision is finally concluded.
But in this time B more branches will have arrived and require storage, which is
impossible unless the jth branch is discarded. This type of failure is called a buffer
overflow for obvious reasons, and it follows from (6.6.6) and the lower bound of
Theorem 6.4.1 that at the jth node
F Svecetin - (uB)°[1 st o(uB)| (6.6.7)
where
E,(p) 0<p<a
R=
p ie a <y
(6.6.8)
We also know from Theorem 6.2.1 that this result is asymptotically tight for
Ry < R<C, provided the buffer is assumed initially empty. Moreover, if we
widen our horizon to include arbitrary time-varying trellis (nonlinear convolu-
tional) codes, the result is asymptotically tight for all rates, according to (6.2.23),
as shown by Savage [1966]. Assuming this wider class of codes*® in the
18 This is to be distinguished from the definition of catastrophic codes given in Chap. 4.
19 Arguments in favor of the stack algorithm point out that, if enough storage is available, the stack
algorithm is preferable because of the reduced computational distribution (by a moderate factor,
independent of L). But the counterargument can be made that, if properly organized, this additional
stack memory can be devoted to the input data buffer in the Fano algorithm, which does not require
the stack, and that this advantage will more than overcome the required increase in computation by
significantly increasing B while only moderately increasing ¥ [see (6.6.9)].
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 377
following, we thus conclude that at the jth node with an initially empty buffer
t gabe Sah bys KH (uB)? (6.6.9)
where % is a constant. Experimental evidence (Forney and Bower [1971]
Gilhousen et al. [1971]) indicates that % is on the order of 1 to 10, and that long
searches are sufficiently rare that the assumption of a nearly empty buffer at the
beginning of each search is reasonably accurate. Then, for a sufficiently low
overflow probability per node, which must be the case for efficient operation, we
would have
Pr {overflow in an #-branch trellis} ~ 44 (uB)-’ (6.6.10)
where p is related parametrically to R by (6.6.8).
Since overflow is almost certainly “catastrophic,” it appears** that one way to
operate a sequential decoder, with finite buffer size and speed factor, is to block off
the data in Y-branch (Yb bit) blocks and to insert, between successive blocks, tails
consisting of (K — 1) branches each containing b zeros. In this way, even if cata-
strophic overflow occurs in one ¥-branch block, the tail allows us to reset the
decoder to the correct state and recommence decoding with a loss of at most #b
bits. Of course, the insertion of tails introduces a reduction of rate by (K — 1)/Y,
and complicates the timing of the decoder. Thus, to keep the degradation small,
K/¥ should be kept small. At the same time, Y cannot be made excessively large
because it appears as a multiplicative factor in block overflow probability (6.6.10).
Typical values used in sequential decoders are Y ~ 500 to 2000, K ~ 20 to 40, and
buffer size in branches? B ~ 10* to 10°. The speed factor u depends, of course, on
the data rate in bits per second, and on the speed and complexity of the digital
logic required for the computations. For example, if we are limited, by a maximum
logic speed, to 10’ branch computations/s, and have a data rate of 10° bits/s, then
pt = 10. Clearly, we must have py > 1 just to keep up with the arriving data. Thus,
for low enough data rates (less than 100 K bits/s), ~B products in excess of 10’ are
possible. Of course, 4 also depends on the complexity of the metric calculation.
Obviously, computation of the Hamming distance metric for the BSC is far sim-
pler than metric computation for an octal output channel; thus, y will be several
times greater for the BSC.
2° Experimental evidence, (Forney and Bower [1971], Gilhousen et al. [1971]) indicates that this
behavior is accurate even for time-invariant linear convolutional codes.
7! There is, however, another strategy which has been implemented effectively with systematic
codes and even some nonsystematic codes. As soon as an overflow occurs, the strategy is to guess the
correct state of the code at this time and start to decode at this point. The most likely state corresponds
to the last (K — 1)b bits (which are transmitted uncoded in a systematic code). If the guess is wrong, the
decoder will again overflow and then the state is again guessed at that time. After several false starts,
the decoder ultimately “synchronizes.”
22 As noted above, this translates into a memory size of Bn[log, J] bits. Thus if we have 2 x 10° bits
available, for a BSC with r = 3(n = 2), this translates to B = 10° branches of buffer storage, while for
an octal output channel with r = } this translates to B = 2.2 x 10* branches.
378 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
We see by comparing (6.6.9) with (6.6.2) that, in sequential decoding, wB plays
a role similar to 2*? in Viterbi decoding. Of course, as we noted, 1B may be of the
order of 10’ at low data rates, while, since 2“~ '” is the number of path memory-
and-metric storage registers in maximum likelihood decoding, it is not feasible to
make this much greater than 10°. On the other hand, at high data rates (107 bit/s
or above), it is not practical to make yp sufficiently greater than unity, as required
for effective sequential decoding, except for a binary-output channel, and even
then this requires very fast digital logic. With Viterbi decoding, however, by
providing a separate metric calculator for each state, we require only that yp = 1.
Also, the highly repetitive “pipeline” nature of a Viterbi decoder, as described in
Chap. 4, serves to reduce its hardware complexity.
Another aspect to be considered is the decoding delay. With sequential decod-
ing, this must be Bb bits; for, in order for the data to be output in the same order
that it was encoded, the same (maximum) delay must be provided for all bits. For
Viterbi decoding, we found in Sec. 5.6 that the maximum delay need only be on
the order of a small multiple of Kb; on the other hand, B is typically two orders of
magnitude greater than K.
The final major consideration which must be included in any choice between
Viterbi decoding and sequential decoding concerns their relative sensitivity to
channel parameter variations, i.e., their robustness. In this category, the sequential
decoder is inferior, for its performance is strongly influenced by the choice of
metric, which depends on the channel parameters (e.g., on the channel error pro-
bability for a BSC, and on the energy-to-noise ratio for an AWGN channel).
Another source of channel variation is the phase tracking inaccuracy (see Sec. 2.5).
In fact, it has been demonstrated that both phase and gain (amplitude) variations
affect sequential decoders more detrimentally than they do Viterbi decoders
(Heller and Jacobs [1971], Gilhousen et al. [1971]). A revealing indication of the
robustness of Viterbi decoders is that, in some cases, the decoder partially offsets
imperfections of the demodulator which precedes it (Jacobs [1974]).
6.7 BIBLIOGRAPHICAL NOTES AND REFERENCES
As was noted previously, the original sequential decoding algorithm was proposed
and analyzed by Wozencraft [1957]. The Fano algorithm [1963], with various
minor modifications, has been analyzed by Yudkin [1964], Wozencraft and Jacobs
[1965], Gallager [1968], and Jelinek [1968a]. Two versions of stack algorithms and
their performance analyses are due to Zigangirov [1966] and Jelinek [1969a]. The
precise form of the Pareto distribution on computation emerged from the works
of Savage [1966] for the upper bound, and of Jacobs and Berlekamp [1967]
for the lower bound.
The development of Secs. 6.2 through 6.4 follows closely the tutorial presenta-
tion of Forney [1974].
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 379
PROBLEMS
6.1 Apply the Hdlder inequality
>
to (6.2.11) through (6.2.13) to verify inequalities (6.2.15) through (6.2.17).
6.2 Suppose the metric defined in (6.1.2) is modified to utilize an arbitrary bias f. That is,
Bea tl ‘
et | P(Yn) .
(a) Show that the correct path mean is positive and the incorrect mean negative, provided B < C.
(b) Show that this modifies (6.2.11) through (6.2.13) by replacing R by f, and similarly for
(6.2.15), but (6.2.16) now involves both R and 8, while (6.2.17) only involves R.
(c) Find the effect on (6.2.19) and Theorem 6.2.1 of using B = R,(1 — €) (choose ap = 4) and thus
show that (6.2.22) is replaced by
Pr {C>L}<AL®/®0-9 = Q<R<R,(1—6)
Find the effect on error probability (6.3.12) and (6.3.13) and thus show that, for low rates, (6.3.12) is
replaced by
P< po ematee eR < RFs)
6.3 If Pr {C > L} = # L ° where R = E,(p)/p, 0 < p< ©
(a) Find the rate R" below which the mean E(C) is finite.
(b) Find the rate R'?’ below which the second moment E(C7”) is finite.
(c) Find the rate R“ below which the kth moment E(C*) is finite.
6.4 (Semisequential Decoding) Consider a constraint length K, rate b/n, convolutional code of B
branches terminated by K — 1 zeros. Suppose we station a genie at the end of the terminated code, and
that we utilize a maximum likelihood decoder of a code of shorter constraint length k < K. Precisely,
let the decoder and genie operate as follows:
1. Suppose k = 1 and decode on this basis the entire B-branch code. If all the right decisions are made,
the genie accepts the result; otherwise, he sends us back to the beginning and step 2.
2. Suppose k = 2 and repeat. Again the genie either accepts the result or sends us back to step 3.
3. Repeat for k = 3 and continue until either the genie accepts or k = K.
Using the results of Chap. 5, show that the probability that the number of computations per
branch exceeds 2* is upper-bounded by
(2° 156 leet ate
1 — 2-5 EAR)/R
Mite 25 <2 O0<p<1
and hence, letting L = 2°*
mi > bei: Bea ed
where R = E,(p)/p and D’ is a constant independent of L.
6.5 (a) Show the relationship between
where /.,;, is given by (6.4.8) and k,,;,
(b) Justify this result intuitively.
is given by (5.5.5).
380 CONVOLUTIONAL CODING AND DIGITAL COMMUNICATION
6.6 Consider a binary tree of depth L + T where two branches diverge from each node at depth less
than L and only one branch emanates from each node from depth L to the final terminal node at depth
L + T. Assume each branch of the tree has n channel symbols independently selected according to
distribution q(x), x € 2%. Suppose one path in the tree is the actual transmitted sequence and let E,,
1 <i <L, be the event that some path to the terminal node diverging from the correct path at node
depth (L — i) is incorrectly decoded. The probability that an incorrect path is decoded when using
maximum likelihood decoding is therefore bounded by
P< ¥ P(E)
For any DMC with input alphabet % show that for P,, averaged over an ensemble of such tree codes
defined by q(x), x € 2, we have
gg (T + 1)nE,(p, q)
Pr< 1 — 2?e~ "Eo(p, a)
where
l+p
E.a)=—mY (Lalso|syerr) "and <p <i
y x
Note that this bound is independent of L. Show that, for any rate r = 1/n bits per channel symbol, less
than capacity, the bound can be made to decrease exponentially with T. Generalize this result to rate
r = b/n bits per channel symbol using a 2°-ary tree of depth L + T (Massey [1974]).
6.7 (The Fano Metric, Massey [1974]) Assume a variable length code {x,, x2, ..., X,,} Where codeword
x,, has a priori probability z,, and length N,,. To each codeword, add a random tail sequence to extend
the codewords to length N = max,, N,,. That is, for codeword x,,, add the tail t,, = (t,, t2,---, ty—n,)
where t,, is randomly chosen according to probability distribution
N—-Nm
Qn-N,(tm) = I] q(t,)
By adding independent random tails to each codeword, a code {z,, Z,, ..., Z,} of fixed block length N,
where z = (x,,, t,,) for each m, is obtained.
(a) Suppose that code {z,, Z,, ..., Z,} is used over a DMC where the decoder does not know
which random tails are used (only their probability distribution). Show that the minimum-probability-
of-error decision rule is to choose m that maximizes L(m, y) for channel output sequence y = (y,, y>,
aij By where
Nm
a PYn|%mn) , 1
Ae We gor ins wie
and
P(Vn) = > P(Val x)a(x)
where p(y|x) is the transition probability of the DMC.
(b) In a sequential decoder, suppose {x,, X,, ..., X,,{ above represents all the paths in the
encoding tree that have been explored up to the present time. The decoder is assumed to know nothing
about the symbols in the unexplored part of the encoded tree except that they are selected indepen-
dently according to q( - ). In order for the decoder to learn the branch symbols that extend any already
explored path, it must pay the price of one computation. (Any sequential decoding algorithm can be
thought of as a rule for deciding which already-explored path to extend.) Show that, when the informa-
tion bits are independent and equally likely, then L(m, y) is the Fano metric given by (6.1.1) and (6.1.2).
Hence the basic stack algorithm always extends the path which is chosen according to the minimum
probability of error criterion.
SEQUENTIAL DECODING OF CONVOLUTIONAL CODES 381
6.8 (Massey [1973]) Suppose we have an r = 4 binary tree code used over the BEC with erasure
probability p. Using the stack decoding algorithm, we note that any path that disagrees with the
received sequence in any unerased position has metric — 00 and can never reach the top of the stack.
Over the ensemble of tree codes and received sequences, we now find bounds on the average computa-
tion per node, of
Following the notation of Sec. 6.2, define the random variable
| 1 path x‘(k) € 2”(j) is extended by the algorithm
e(x, x(k), y) =
: \o otherwise
and thus
@ 2k-s-1
C= YY ex, xjilk), y)
k=j+1 i=1
where x’,(k) is the ith path in 2”(j) at node depth k. (There are 2*~/~* such paths.)
(a) Show that, over the ensemble of tree codes and received sequences,
Pr {e(x, x/,(k), y) = 1} < Pr {path x/,(k) agrees with y in all unerased positions}
ae ae oP
=2-28-Ire Fy = —log (1 - 44)
Then show that
fae 2ro
ene ee ; cee |
‘ cor provided r=5<ry
(b) Next observe that, whenever path xj,(k) reaches the top of the stack before x(k), then
e(x, xj;(k), y) = 1. Show that
Pr {e(x, xj;(k), y) = 1} > 3(2- 7“)
and thus
provided r=4<r,
ree, tn ae Per i
Pa oes: Peer
Bae 2
‘
fe
-
Z
a a - ae
Yori
9:
PART
THREE
SOURCE CODING FOR
DIGITAL COMMUNICATION
ea
eae
pa
ca
=
Ae
my
{
~
ne
: e
4
a
vn
*
Pu.
i
CHAPTER
SEVEN
RATE DISTORTION THEORY:
FUNDAMENTAL CONCEPTS
FOR MEMORYLESS SOURCES
7.1 THE SOURCE CODING PROBLEM
Rate distortion theory is the fundamental theory of data compression. It estab-
lishes the theoretical minimum average number of binary digits per source symbol
(or per unit time), i.e., the rate, required to represent a source so that it can be
reconstructed to satisfy a given fidelity criterion, one within the allowed distortion.
Although the foundations were laid by Shannon in 1948, it was not until 1959 that
Shannon fully developed this theory when he established the fundamental
theorems for the rate distortion function of a source with respect to a fidelity
criterion which endow this function with its operational significance. Initially, rate
distortion theory did not receive as much attention as the better known channel
coding theory treated in Chaps. 2 through 6. Ultimately, however, interest grew in
expanding this theory and in the insights it affords into data compression practice.
Let us now re-examine the general basic block diagram of a communication
system depicted in Fig. 7.1. As always we assume that we have no control over the
source, channel, and user.' We are free to construct only the encoders and de-
coders. In Chap. 1 we determined the minimum number of binary symbols per
source symbol such that the original source sequence can be perfectly reconstructed
by observing the binary sequence. There we found that Shannon’s noiseless coding
1 In earlier chapters we referred to the user as the destination. To emphasize the active role of the
user of information in determining the fidelity measure, we now call the final destination point the user.
385
386 SOURCE CODING FOR DIGITAL COMMUNICATION
S Sees cereal
| |
| |
| |
Source .. Source ae Channel |
{ encoder encoder ‘
|
|
: Encoder 3 ,
Channel
a Se eee _
|
| |
: |
Fae l Source Channel | _ |
decoder decoder | — \
| |
| Decoder |
prostate k Snceah ecient caw 5 oe pee ew eat ee =J
Figure 7.1 Communication system model.
theorem gave operational significance to the entropy function of a source. In this
chapter we generalize the theory of noiseless source coding in Chap. 1 by defining a
distortion measure and examining the problem of representing source symbols
within a given fidelity criterion. We shall examine the tradeoff between the rate of
information needed to represent the source symbols and the fidelity with which
source symbols can be reconstructed from this information.
Chapters 2 through 6 were devoted to the channel coding problem where we
restricted our attention to only the part of the block diagram of Fig. 7.1 consisting
of the channel encoder, channel, and channel decoder. In these chapters, we
showed that channel encoders and decoders can be found which ensure an arbi-
trarily small error probability for messages transmitted through the channel
encoder, channel, and channel decoder as long as the message rate.is less than the
channel capacity. For the development of rate distortion theory, we assume that
ideal channel encoders and decoders are employed so that the link between the
source encoder and source decoder is noiseless as shown in Fig. 7.2.7 This requires
the assumption that the rate on this link is less than the channel capacity.
The assumption that source and channel encoders can be considered
separately will be justified on the basis that, in the limit of arbitrarily complex
overall encoders and decoders, no loss in performance results from separating
source and channel coding in this way. Representing the source output by a
sequence of binary digits also isolates the problem of source representation from
that of information transmission. From a practical viewpoint, this separation is
desirable since it allows channel encoders and decoders to be designed inde-
pendently of the actual source and user. The source encoder and source decoder
in effect adapt the source and user to the channel coding system.
2 This is also a natural model for storage of data in a computer. In this case the capacity of the
noiseless channel represents the limited amount of memory allowed per source symbol.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 387
Fee ee.
u Source Noiseless ! Source v
rce aie —E— tn User
Source encoder channel decoder
ae
Figure 7.2 Source coding model.
We begin by defining a source alphabet %, a user alphabet Y (sometimes
called the representation alphabet), a distortion measure d(u, v) for each pair of
symbols in @ and ¥, and a statistical characterization of the source. With these
definitions and assumptions, we can begin our discussion of rate distortion theory.
For this chapter we will consider discrete-time memoryless sources that emit a
symbol u belonging to alphabet W each unit of time, say every T; seconds. Here the
user alphabet ¥ depends on the user, although in many cases it is the same as the
source alphabet. Throughout this chapter we will also assume a single-letter dis-
tortion measure between any source symbol u and any user symbol v represented
by d(u, v) and satisfying
d(u, v) >0 (7.1.1)
This is sometimes referred to as a context-free distortion measure since it does not
depend on the other terms in the sequence of source and user symbols.
Referring to Fig. 7.2 we now consider the problem of source encoding and
decoding so as to achieve an average distortion no greater than D. Suppose we
consider all possible encoder—decoder pairs that achieve average distortion D or
less and denote by &> the set of rates required by these encoder—decoder pairs.
By rate R, we mean the average number of nats per source symbol? transmitted
over the link between source encoder and source decoder in Fig. 7.2. We now
define the rate distortion function for a given D as the minimum possible rate R
necessary to achieve average distortion D or less. Formally, we define*
R*(D)= min R___esanats/source symbol (7.1.2)
Re&p
Naturally this function depends on the particular source statistics and the distor-
tion measure. This direct definition of the rate distortion function does not allow
us actually to evaluate R*(D) for various values of D. However, we shall see that
this definition is meaningful for all stationary ergodic discrete-time sources with a
single-letter distortion measure, and for these cases we will show that R*(D) can
be expressed in terms of an average mutual information function, R(D), which will
be derived in Sec. 7.2.
There is another way of looking at this same problem, namely the distortion
rate viewpoint. Suppose we consider all source encoder—decoder pairs that require
fixed rate R and let Dp be the set of all the average distortions of these encoder-
3 Recall that R = r In 2 nats per symbol where r is the rate measured in bits per symbol.
* Strictly speaking, the “minimum” here should be “ infimum.”
388 SOURCE CODING FOR DIGITAL COMMUNICATION
decoder pairs. Then, analogously to the previous definition, we define the distortion
rate function as
D*(R) = min D (7.1.3)
De QDR
For stationary ergodic sources with single-letter distortion measures, the
definitions of R*(D) and D*(R) yield equivalent results, the only difference being
the choice of dependent and independent variables.
The study of rate distortion theory can be divided roughly into three areas.
First, for each kind of source and distortion measure, one must find an explicit
function R(D) and prove coding theorems which show that it is possible to achieve
an average distortion of D or less with an encoding and decoding scheme of rate R
for any rate R > R(D). A converse must also be derived which shows that if an
encoder-—decoder pair has rate R < R(D), then it is impossible to achieve average
distortion of D or less with this pair. These two theorems (direct and converse)
establish that R*(D) = R(D) and give operational significance to the function
R(D). The second area concerns the actual determination of the optimal attainable
performance, and this requires finding the form of the rate distortion function,
R*(D), for various sources and distortion measures. Often when this is difficult,
tight bounds on R*(D) can be obtained. The final category of study deals with the
application of rate distortion theory to data compression practice. Developing
effective sets of implementation techniques for source encoding which produces
rates approaching R*(D), finding meaningful measures of distortion that agree
well with users’ needs, and finding reasonable statistical models for important
sources are the three main problems associated with application of this theory to
practice.
In this chapter, we develop the basic theory for memoryless sources, beginning
with block codes for discrete memoryless sources in the next section, and its
relationship to channel coding theory in Sec. 7.3. Results on tree codes and trellis
codes are presented in Sec. 7.4. All these results are extended to continuous-
amplitude (discrete-time) memoryless sources in Sec. 7.5. Sections 7.6 and 7.7
treat the evaluation of the rate distortion function for discrete memoryless sources
and continuous-amplitude memoryless sources, respectively. Various generaliza-
tions of the theory are presented in Chap. 8, including sources with memory and
universal coding concepts.
7.2 DISCRETE MEMORYLESS SOURCES—BLOCK CODES
In this section and the following two sections we shall restrict our study of source
coding with a fidelity criterion to the case of a discrete memoryless source with
alphabet W = {a,, a,,..., a4} and letter probabilities Q(a,), Q(a>), ..., Q(a4).
Then in each unit of time, say T, seconds, the source emits a symbol ue W
according to these probabilities and independent of past or future outputs. The
user alphabet is denoted ¥ = {b,, b,, ..., bg} and there is a nonnegative distor-
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 389
tion measure d(u, v) defined for each pair (u, v) in @ x ¥. Since the alphabet is
finite, we may assume that there exists a finite number dy such that for all ue W
and vev
0 <d(u, v) <d) < @ (7.2.1)
In this section, we consider block source coding where sequences of N source
symbols will be represented by sequences of N user symbols. The average amount
of distortion between N source output symbols u = (u;, uz, ..., uy) and N rep-
resentation symbols v = (v;, v2, ..., Uy) is given by
1 N
d = — P22
y(u, ¥) N as ( )
Let B = {V¥1, V2, ..., Vy} be aset of M representation sequences of N user symbols
each. This is called a block source code of size M and block length N, and each
sequence in & is called a codeword. Code # will be used to encode a source
sequence u € W, by choosing the codeword v € # which minimizes d,(u, v). We
denote this minimum by
d(u|#) = min dy(u, v) (7.2.3)
veB
and we define in a natural way the average distortion achieved with code Z as
=Y Q,(u)d(u| Z) (7.2.4)
where
N
= [| Qu,) (7.235)
n=1
as follows from the assumption that the source is memoryless.
Each N units of time when N source symbols are collected by the source
encoder, the encoder selects a codeword according to the minimum distortion rule
(7.2.3). The index of the selected codeword is then transmitted over the link
between source encoder and source decoder. The source decoder then selects the
codeword with this transmitted index and presents it to the user. This block
source coding system is shown in Fig. 7.3. Since, for each sequence of N source
symbols, one of M indices is transmitted over the noiseless channel between the
encoder and decoder (which can be represented by a distinct binary sequence
whose length is the smallest integer greater than or equal to log M) the required
Search for
_ codeword v,, in m Choose vi
os >| @whi ms SE —>| code word —=—e4 User
whic ini- :
Uy mizes dy (u, v) meth; 2.) 22M} Vin €B Vy
n UW,
Figure 7.3 Block source coding system.
390 SOURCE CODING FOR DIGITAL COMMUNICATION
rate’ is R = (In M)/N nats per source symbol. In the following we will refer to
code ¥ as a block code of block length N and rate R.
For a given fidelity criterion D, we are interested in determining how small a
rate R can be achieved when d(&) < D. Unfortunately, for any given code &%, the
average distortion d(Z) is generally difficult to evaluate. Indeed, the evaluation of
d(Z) is analogous to the evaluation of error probabilities for specific codes in
channel coding. Just as we did in channel coding, we now use ensemble average
coding arguments to get around this difficulty and show how well the above block
source coding system can perform. Thus we proceed to prove coding theorems
that establish the theoretically minimum possible rate R for a given distortion D.
Let us first introduce an arbitrary conditional probability distribution
P(vju):vEeV, ue U}.° For sequences ue Wy and v € Wy, we assume condi-
tional independence in this distribution so that
Py(v|u) = TP P(v,|u,) (7.2.6)
Corresponding marginal probabilities are thus given by
P,(v) = d Py(v|u)Qy(u)
bs P(v,) (7.2.7)
n=1
=) P(r|uje
Similarly, applying Bayes’ rule, we have the backward conditional probabilities
Py(v |u)Qy(u)
where
y(u |v) | Py(v)
ae [I Q(u,,| v,) (7.2.8)
where ee
ul») = Fee)
We attach no physical significance to the conditional probabilities
P(vju):ve Vv, ue W}, but merely use them as a convenient tool for deriving
bounds on the average distortion when using a code & of size M and block length
N.
° M is usually taken to be a power of 2; however, even if this is not the case, we may combine the
transmission of several indices into one larger channel codeword and thus approach R as closely as
desired.
© We shall denote all probability distribution and density functions associated with source coding
by capital letters.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 391
Recall from (7.2.4) that the average disiortion achieved using code Z is
d(Z) = ¥) Qv(u)d(u|Z) (7.2.4)
Since
ees
we can also write this as
(4) = FY Os(u)Py(v |u)alu| A) (7.2.9)
Here v € ¥ y is not a codeword but only a dummy variable of summation. We
now split the summation over u and vy into two disjoint regions by defining the
indicator function
_ fl dy(u, v) < d(u |Z)
O(u, v; J 7.2.10
A= Wy atu, v) > d(ul a) oY
Since (1 — ®)+ ® = 1, we have
d(B) = > > On(u)Px(v|u)d(u| A)[1 — Olu, v; A)]
+ YY Qy(u)Pr(¥|u)d(u| A)O(u, v; A) (7.2.11)
Using the inequality, which results from definition (7.2.10)
d(u| A#)[1 — O(u, v; B)] < dy(u, v) (7.2.12)
in the first summation and using the inequality, which follows from (7.2.1)
d(u| #) = min dy(u, v) < do (7.2.13)
veB
in the second summation in (7.2.11), we obtain the bound
dB) < YY On(u)Po(v|u)dy(uy v) + do YY Qy(u)Pa(v[u)O(u, vs A) (7.2.14)
The first term in this bound simplifies to
YY On(wpPr(e lu) dy(u, ») = YY Qn(wyPatv lu) yD Alte
= D(P) (7.2.15)
To bound the second term, we need to apply an ensemble average argument.
In particular, we consider an ensemble of block codes of size M and block length
392 SOURCE CODING FOR DIGITAL COMMUNICATION
N where & = {v,, V2, ..., Vay} is assigned the product measure
P(B) = T] Pr (Vm) (7.2.16)
where Py(v) is defined by (7.2.7) and is the marginal distribution corresponding to
the given conditional probability distribution {P(v|u): ve V,ue W}. Averages
over this code ensemble will be denoted by an upper bar (- ). The desired bound
for the ensemble average of the second term in (7.2.14) is given by the following
lemma.
Lemma 7.2.1
YY Qy(u)Py(v |u)@(u, v; B) < e NPR PP (7.2.17)
where
E(R; p, P) = —pR + E,(p, P) (7.2.18)
1+p
E,(p, P) = —In ¥ 1d. P(v)Q(ulv)'/CT” —1<p<0
In M
ee
Proor Using the Holder inequality (see App. 3A), we have, for any
dd Qn(u)Pr(v|u)P(u, v; B)
=)» Pr(v)Qn(u|v)P(u, v; B)
me
< 3 sy Py(v)Oy(u |v)! id » Py(v)®(u, Vv; B) (7.2.19)
since it follows from definition (7.2.10) that ®~'/’ = ®. Averaging this over
the code ensemble and applying the Jensen inequality over the same range of
p yields
2 2 Qn(u)Py(v |u)®(u, v; B)
j1+0] a
<)) ID Palv)Qx(ulv)"O*” » Prx(v)®(u, v; ul
< 2 d Palatal yy” oe y> Py(v)P(u, v; Z) ish (7.2.20)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 393
The second bracketed term above is simply
> Pr(v)P(u, v; 4)= > > 2, Pulv) (v)Py(v1)°°* Py(V¥s)P(u, v; A)
vi 4
= Pr {dy(u, v) < min (dy(u, v,), dy(u, v2), ..., dy(u, Vay))}
1
M+1
= 7.2.21
since the code Z has the product measure given in (7.2.16) and thus, for a fixed
u, each of the random variables dy(u, v), dy(u, v;), ..., dy(u, ¥y7), which are
independent and identically distributed, has the same probability of being the
minimum. Using (7.2.21) in (7.2.20), we have
dd Qn(u)Py(v|u)O(u, v; 4) <M? > rE Px(v)Qx(u eee
= ps ldedl Plen\QCu 2)"
u v n=1
N Jitp
os eNeR 2 TT > P( (v)Q(u n|v) eee,
u _[n=1 v
N Jit+p
— gNeR I] » P(v)Q(u n| 0) ji +)
n=1 v
= eNPRT] Y | Pejatujayrrn |”
r
()Qujoyerr]"l
= @7 NEAR: p.P) (7.2.22)
Let us briefly examine the behavior of this bound for various parameter
values. As stated in the above lemma, the bound given in (7.2.17) applies for all p
in the range —1<p<0O and for any choice of the conditional probability
{P(v|u):veV, ue W. The expression E(R; p, P) is identical to the random
coding exponent in channel coding theory introduced in Sec. 3.1. The only differ-
ence is that here the parameter p ranges between —1 and 0 while for channel
coding this parameter ranges from 0 to 1. Also, here we can pick an arbitrary
conditional probability {P(v|u)} which influences both P(v) and Q(u|v), while in
the channel random coding exponent the channel conditional probability is fixed
and only the distribution of the code ensemble is allowed to change. In the
following lemmas, we draw upon our earlier examination of the random coding
bound for channel coding. Here E,(p, P) is a form of the Gallager function first
defined in (3.1.18).
394 SOURCE CODING FOR DIGITAL COMMUNICATION
Lemma 7.2.2
1+ p
E,(p, P) = —In Y |Y P(w)Q(u|o)*” (7.2.23)
has the following properties for —1 <p <0:
E,(p, P) <0
CE Ate Es j (P) >0 (7.2.24)
Op
2
O°E(p, P) _ 4
dp?
E,(0, P) = 0
6E,(p, P)
E,(0, P) = == ViP
ONR . e)
where’
r
1(P) =. Q(u)P(v|u) In vo (7.2.25)
is the usual average mutual information function.
PRrooF This lemma is the same as Lemma 3.2.1. Its proof is given in App. 3A.
Since we are free to choose any p in the interval —1 < p <0, the bound in
Lemma 7.2.1 can be minimized with respect to p or, equivalently, the negative
exponent can be maximized. We first establish that the minimum.always corre-
sponds to a negative exponent, and then show how to determine its value.
Lemma 7.2.3
max E(R;p,P)>0 # for R>I(P) (7.2.26)
hk SA
ProoF It follows from the properties given in Lemma 7.2.2 and the mean
value theorem that, for any 6 > 0, there exists a p, in the interval —1 < p, <0
such that®
et E,(0, P) a E (po; P)
E,(0, P) Ee
< E,(0, P) + 6
7 1(P) = 1(%; ¥) was first defined in Sec. 1.2. Henceforth, the conditional probability distribution
is used as the argument because this is the variable over which we optimize.
8 We assume E,(p, P) is strictly convex © in p. Otherwise this proof is trivial.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 395
which, since E,(0, P) = 0 and E/(0, P) = 1(P), implies
E,(p., P) = pol I(P) + 9] (7.2.27)
Hence
max E(R;p,P)= max [—pR+E,(p, P)|
-1<p<0 -1<p<0
ee else)
> —p.R+ p[l(P) + 4]
= pik sia) — 0]
We can choose 6 = [R — I(P)|/2 > 0 so that
max E(R; p, P)> =9.( SS )) >0 (7.2.28)
ol Spee 2
Analogously to the channel coding bound for fixed conditional probability
distribution {P(v|u): ve V7, ue WY}, the value of the exponent
max E(R;p, P)
=i<9<0
is determined by the parametric equation
max E(R; p, P)= —p*R+ E,(p*, P)
—i<ps0
R = Edo, P) (7.2.29)
Cp |p=os
for
I(P) < R = 6E,(p, P)
dp p=-1
and —1 < p* <0. In Fig. 7.4 we sketch these relationships.
Now let us combine these results into a bound on the average distortion using
codes of block length N and rate R. We take the code ensemble average of d(#)
given by (7.2.14) and bound this by the sum of (7.2.15) and the bound in Lemma
7.2.1. This results in the bound on d(Z) given by
d(@) < D(P) + doe NER?) (7.2.30)
for any —1 < p <0. Minimizing the bound with respect to p yields
d(Z) < D(P) + dy ed N| max E(R; p, P) | (7.2.31)
a =i Ss p<0
where
max E(R;p,P)>0 for R>I(P)
=i Sps0
396 SOURCE CODING FOR DIGITAL COMMUNICATION
Slope =R
Slope = /(P)
p*
| p
—| | ]
|
|
|
|
ee a oe ee ee ee ee eee Eo (p*, P)
CARRE, Sette at reagl SACI p*R
Figure 7.4 Eo(p, P) curve.
and
D(P) =) d, Qlu)P(v|u)d(u, v)
At this point we are free to choose the conditional probability {P(v|u)} to mini-
mize the bound on d(&) further. Suppose we are given a fidelity criterion D which
we wish to satisfy with the block source encoder and decoder system of Fig. 7.3.
Let us next define the set of conditional probabilities that satisfy the condition
D(P) <D
Py = {P(v|u): D(P) < D} (7.2.32)
It follows that Pp is a nonempty, closed, convex set for all
D>=Y¥ Q(u) min d(u, v) = Dain (8235)
u vev
since in defining v(u) by the relation d(u, v(u)) = min d(u, v) we may construct the
conditional distribution
Prin(v |u) = (7.2.34)
0 v # v(u)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 397
which belongs to Y, and achieves the lower bound. Now we define the source
reliability function
E(R, D)= max max E(R; p, P) (7.2.35)
Pe Fp —1<p<90
and the function
R(D) = min I(P) (7.2.36)
Pe Pp
We will soon show that in fact R(D) is the rate distortion function as defined in
(7.1.2), but for the moment we shall treat it only as a candidate for the rate
distortion function. With these definitions we have the source coding theorem.
Theorem 7.2.1: Source coding theorem For any block length N and rate R,
there exists a block code # with average distortion d(Z) satisfying
d(Z) < D+ dye NFR 237)
where
E(R,D)>0 for R> R(D)
ProoF Suppose P* € Yp achieves the maximization (7.2.35) in the source
reliability function. Then from (7.2.31) we have
d(Z) < D(P*) + dye NF®: P) (7.2.38)
where
E(R,D)>0 = for R > 1(P*)
But by definition (7.2.32) of Ap, we have D(P*) < D. Also since
E(R, D)> max E(R;p,P)>0 forR>I(P)
—-il<p=0
where P can be any Pe #p, we have
E(R,D)>0 for R> min I(P) = R(D)
Pe Pp
Hence
d(B) < D + dye NER) (7.2.39)
where E(R, D) > 0 for R > R(D). Since this bound holds for the ensemble
average over all codes of block length N and rate R, we know that there exists
at least one code whose distortion is less than or equal to d(#), thus complet-
ing the proof.
Example (Binary symmetric source, error distortion) Let ¥ = Y = {0, 1} and d(u, v) = 1 — 6,,.
Also suppose Q(0) = Q(1) = 4. By symmetry, the distribution P ¢ #, that achieves both E(R, D)
and R(D) is given by
a |D v#xzu
i ta Saas ou
where O0<D<} (7.2.40)
398 SOURCE CODING FOR DIGITAL COMMUNICATION
Then the parametric equations (7.2.29) become (see also Sec. 3.4)
E(R, D) = E(R; p*, P) = —6, In D — (1 — 6p) In (1 — D) — #(6,) (7.2.41)
and
R=In2—- #(6p) (7.2.42)
where
plate)
dp —l1<p<0 (7.2.43)
~ puate) 4 (y — plate
E(R, D) is sketched in Fig. 7.5 for 0 < D <4 and R(D) < R < In 2 where R(D) = In 2 — #(D).
E(K,D)
0.1
R=In2-KxXWM)
Figure 7.5 Sketch of E(R, D) for the binary symmetric source with error distribution.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 399
Theorem 7.2.1 shows that, as block length increases, we can find a code of any
rate R > R(D) whose average distortion is arbitrarily close to D. A weaker but
more common form of this theorem is given next.
Corollary 7.2.2 Given any «> 0, there exists a block code & of rate
R < R(D) + « with average distortion d(#) < D+.
PRooF Let R satisfy
R(D)<R<R(D)+.e
and choose N large enough so that
de NEO: Oe
In order to show that R(D) is indeed the rate distortion function, we must
show that it is impossible to achieve an average distortion of D or less with any
source encoder—decoder pair that has rate R < R(D). To show this we first need
two properties of I(P). First let {P,(v|u):v ¢ Wy,u € Wy} be any arbitrary condi-
tional distribution on sequences of length N. Also let P(v,|u,) be the marginal
conditional distribution for the nth pair (v,, u,) derived from this distribution.
Defining
(Ps) = Y Oa(u)Pa(v|u) in “201 (7.2.44)
and
1(P) = YY Olw)P(o|u) In = aera (7.2.45)
where
Ov(u) = [] Qu)
and
P®(v) =), O(u)P™(v|u)
we have the following inequalities
ee | ae
Ni— > P@}<— V1 7.2.46
iir)<hiuey oe
and
< FP) < + 1(Py) (7.2.47)
400 SOURCE CODING FOR DIGITAL COMMUNICATION
Inequality (7.2.46) is the statement that I(P) is a convex U function of P. This
statement is given in Lemma 1A.2 in App. 1A. Inequality (7.2.47) can be shown
using an argument analogous to the proof of Lemma 1.2.2 for I(%y; Yy) given in
Chap. 1 (see Prob. 7.1).
Theorem 7.2.3: Converse source coding theorem For any source encoder-
decoder pair it is impossible to achieve average distortion less than or equal to
D whenever the rate R satisfies R < R(D).
Proor Any encoder—decoder pair defines a mapping from source sequences
to user sequences. For any length N, consider the mapping from W, to Wy
where we let M be the number of distinct sequences in Wy into which the
sequences of Wy, are mapped. Define the conditional distribution
1 if u is mapped into v
P =
nv |) 0 otherwise
(7.2.48)
and let P”(v|u) be the resulting marginal conditional distribution on the nth
terms in the sequences. Also, define the conditional distribution
N
P(v|u) = < 2 PM | u) (7.2.49)
Now let us assume that the mapping results in an average distortion of D
or less. Then
D(Px) = YY. On(u)Py(v | u)dy(u |v)
=) Qv(u)dy(u, v(u))
<D (7.2.50)
where u is mapped into v(u). But by definition (7.2.2)
eM@!
<M
tO
2
—_~
i=
—
a
2
—_
<
=
Pieaeas
Qu
—
=
=
S
=
ae
_
i—~4jz Me
_
eM
lead
eo)
—_
=
—
v
=
_~
eS
=
—
Qu
_—
=
S¢
—
<D (7.2.51)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 401
where the inequality follows from (7.2.50). Hence P(v|u) given by (7.2.49)
belongs to Pp and so
(7.2.52)
We used here inequalities (7.2.46), (7.2.47), and I(P,) < In M (see Prob. 1.7).?
Hence, D(Py) < D implies that R(D) < R, which proves the theorem.
Note that this converse source coding theorem applies to all source encoder-
decoder pairs and is not limited to block coding. For any encoder—decoder pair
and any sequence of length N, there is some mapping defined from W%, to ¥', and
that is all that is required in the proof. Later in Sec. 7.3 when we consider non-
block codes called trellis codes, this theorem will still be applicable.
The source coding theorem (Theorem 7.2.1) and the converse source coding
theorem (Theorem 7.2.3) together show that R(D) is the rate distortion function.
Hence for discrete memoryless sources we have R*(D) = R(D) where
R(D)= min I(P) __nats/source symbol
KP) = 5 Y Qw)P(e|u) in i ee (7.2.53)
Ce P(e) 2 2 Q(u)P(v|u)d(u, v) < D|
Thus for this case we have an explicit form of the rate distortion function in terms
of a minimization of average mutual information.
The preceding source coding theorem and its converse establish that the rate
distortion function R(D) given by (7.2.53) specifies the minimum rate at which the
source decoder must receive information about the source outputs in order to be
able to represent it to the user with an average distortion that does not exceed D.
° With entropy source coding discussed in Chap. 1 it may be possible to reduce the rate below
(In M)/N, but never below I(P,)/N.
402 SOURCE CODING FOR DIGITAL COMMUNICATION
Theorem 7.2.1 also shows that block codes can achieve distortion D with rate
R(D) in the limit as the block length N goes to infinity. For a block code & of finite
block length N and rate R, it is natural to ask how close to the rate distortion limit
(D, R(D)) we can have (d(#), R). The following theorem provides a bound on the
rate of convergence to the limit (D, R(D)).
Theorem 7.2.4 There exists a code & of block length N and rate R such that
Os ABj—D <d,e S0Nre (7.2.54)
when
0 < 6(N) = R— R(D) <3C,
where C, = 2 + 16[In A]? is a constant such that for all P
0°E,(p, P)
6p?
> -C, —4<p<0
PRooF From (7.2.30) we know that, for each p in the interval — 1 < p < O and
for the conditional probability {P(v|u):v e WV, u € W}, there exists a code Z of
block length N and rate R such that
d(Z) < D(P) + dge NER PP (7.2.55)
Recall from (7.2.18) that
E(R; p, P) = —pR + E,(p, P)
1+p (7.2.56)
E,(p, P) = —In 2 rig P(v)Q(ul vr”
For fixed P, twice integrating E”(p, P) = 07E,(p, P)/dp yields
- oF
| | Be(, P) da dB = —pE,(0, P) + E,(p, P)— E,(0, P) (7.2.57)
Ge 20
Since E,(0, P) = 0 and E/,(0, P) = I(P), we have
p .B
E,(p, P)= pl(P) + | | Ez(a, P) da ap (7.2.58)
0
Let C, be any constant upper bound to — E?(p, P). (See Prob. 7.3, where we
show that C, = 2 + 16[In A]? is such a bound for —4 < p <0.) Then
p .B
E,(p, P) > pl(P)— | | C, dx dp
0. * =O
0?
=pI(P)-—C, -3<p<0 (7.2.59)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 403
Hence
2
E(R; p, P) > —pR + pI(P) — é: (7.2.60)
Now choose P* € Y, such that /(P*) = R(D). Then
p
E(R; p, P*) > —p[R — R(D)] - mn e.. (7.2.61)
Defining 6(N) = R — R(D), we choose
6(N)
+= ——— 7.2.62
p C. (7.2.62)
where 6(N) is assumed small enough to guarantee —4 < p* <0. Then
5°(N)
ER aX; P* id
pt P*)> 3c. (7.2.63)
and putting this into (7.2.55) gives
UB) < D(P*) + d,e te
< D + doe N2 (2Co (7.2.64)
There are many ways in which the bound on (d(#), R) can be made to
converge to (D, R(D)). For example, for some constant « > 0
6(N) = R — R(D) = aN 38 (7.2.65)
yields
2a71/4
0<d(Z)-— D<dyexp iad (7.2.66)
be ace
A different choice of
l
5(N) = R — R(D) =a “= (7.2.67)
yields
0 <d(J do
which shows that, if R + R(D) as ./(In N)/N, we can have d(#)— D as N~’ for
any fixed y > 0.
Although Theorem 7.2.4 does not yield the tightest known bounds on the
convergence of (d(#), R) to (D, R(D)) (cf. Berger [1971], Gallager [1968], Pilc
[1968]), the bounds are easy to evaluate. It turns out that some sources called
symmetric sources have block source coding schemes that can be shown to con-
verge much faster with block length (see Chap. 8, Sec. 8.5).
404 sOURCE CODING FOR DIGITAL COMMUNICATION
7.3 RELATIONSHIPS WITH CHANNEL CODING
There are several relationships between the channel coding theory of Chaps. 2
through 6 and rate distortion theory. Some of these appear simply because the
same mathematical tools are applied in both theories, while others are of a more
fundamental nature.
Suppose we no longer assume a noiseless channel and consider both source
and channel block coding as shown in Fig. 7.6. Assume the discrete memoryless
source emits a symbol once every 7, seconds and that we have a source encoder
and decoder for a source block code, %, of block length N and rate R nats per
symbol of duration T, seconds. For each NT, seconds, a minimum distortion
codeword in # = {v,, V2,..., Vy} is chosen to represent the source sequence
u € Wy, and the codeword index m is sent over the channel. Hence, once every NT,
seconds, one of M = e’® messages is sent over the channel. We assume that the
memoryless channel is used once every T, seconds and has a channel capacity of C
nats per channel use of duration T, seconds. The channel encoder and decoder use
a channel block code, @, of block length N and rate R nats per channel use
where’°®
T.N = T,N
3 (7.3.1)
T.KR=1;,R
‘0 It is not strictly necessary for the channel block length to satisfy (7.3.1) since the channel encoder
can regard sequences of source encoder outputs as channel input symbols; that is, N could be any
multiple of its value as given by (7.3.1).
DMS u Search for me {1,2,...,M} Choose a x
U.T.,Q >| a codeword >| codeword
ae Uy V,€@® |Rnats every T, seconds| X,,€C oan a
R nats every T, seconds
Source encoder Channel encoder 1
Me
~~
Channel source - DMC
Pr {&}=Prim#m} je 9
Vy, Choose a me {1,2,....M) Search for y
User < codeword |e a codeword |
Vy vn e® R nats every T, seconds x7, EC Uy
Source decoder Channel decoder
oy
—~—
Channel destination
Figure 7.6 Combined source and channel coding.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 405
Here let & = {m+ m} be the event that a channel message error occurs and let
& = {m = m} denote its complement. The average distortion attained when using
source code # and channel code @ is
d(B, 6) = E{dy(u, v,)|B,8)}
= E{d,(u, v,,)|4, G, &} Pr{é} + E{dy(u, v,,)|B, €, &} Pr{é} (7.3.2)
where the expectation E{-} is over both source output random variables and
noisy channel outputs. When no channel errors occur, we have
dy(u, V,,) = dy(u, V,,) = d(u|#) = min d,(u, v)
veB
From (7.2.1), we have
E{dy(u,v,,)|4, G, €} < do (7.3.3)
Substituting this bound and
Pr{é} <1 (7.3.4)
in (7.3.2), we have
d(B, @) < ¥ Qn(u)d(u| A) + do Pr{é} (7.3.5)
From channel coding theory (Theorem 3.2.1), we know that there exists a
channel code ¥ such that the probability of a channel message error Pr{é&} is
bounded by
Prigi<e
= g~ (TJ TINE(TR/T.) (7.3.6)
where
~
T.R Pins Me
E(*)>0 when R = = yet
Similarly from Theorem 7.2.1, we know that there exists a source code # such that
d Ox(u)d(u|B) < D+ dye **®” (7.3.7)
where E(R, D) > 0 for R > R(D). Applying these codes to the combined source
and channel coding scheme of Fig. 7.6, substituting (7.3.6) and (7.3.7) in (7.3.5),
yields the average distortion given by the following theorem.
Theorem 7.3.1 For the combined source and channel coding scheme of
Fig. 7.6 discussed above, there exists a source code # of rate R and block
length N and a channel code @ of rate R and block length N satisfying (7.3.1)
such that the average distortion is bounded by
d(Z, €) < D zs 3, D) a} dge” THT EG Ae (7.3.8)
406 SOURCE CODING FOR DIGITAL COMMUNICATION
where
for R satisfying
R(D)<R<C (7.3.9)
where
L ~~
C=— a §
T a (7.3.10)
is the channel capacity in nats per T, seconds.
As long as the rate distortion function is less than the channel capacity,
R(D) <C, we can achieve average distortion arbitrarily close to D. When
R(D) > C, this is impossible, as established by the following.
Theorem 7.3.2 It is impossible to reproduce the source in Theorem 7.3.1 with
fidelity D at the receiving end of any discrete memoryless channel of capacity
C < R(D) nats per source letter.
Proor The proof of this converse follows directly from the data-processing
theorem (Theorem 1.2.1) and the converse source coding theorem (Theorem
7.2.3) (see Prob. 7.5).
The above converse theorem is true regardless of what type of encoders and
decoders are used. In fact, they need not be separated as shown in Fig. 7.6, nor do
they need to be block coding schemes for Theorem 7.3.2 to be true. Since
Theorem 7.3.1 is true for the block source and channel coding scheme of Fig. 7.6,
we see that in the limit of large block lengths there is no loss of generality in
assuming a complete separation of source coding and channel coding. From a
practical viewpoint, this separation is desirable since it allows channel encoders
and decoders to be designed independently of the actual source and user. The
source encoder and decoder in effect adapts the source and user to any channel
coding system which has sufficient capacity. As block length increases, the source
encoder outputs become equally likely (asymptotic equipartition property) so that,
in the limit of large block lengths, all source encoder outputs depend only on the
rate of the encoder and not on the detailed nature of the source.
From Fig. 7.6, we see a natural duality between source and channel block
coding. The source encoder performs an operation similar to the channel decoder,
while the channel encoder is similar to the source decoder. Generally, in channel
coding, the channel decoder is the more complex device, while in source coding
the source encoder is the more complex device. We shall see in Sec. 7.4 that this
duality also holds for trellis coding systems. Finally. we note that, although the
source encoder removes redundancy from source sequences and channel encoding
adds redundancy, these operations are done for quite different reasons. The source
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 407
encoder takes advantage of the statistical regularity of long sequences of the
source output in order to represent the source outputs with a limited rate R(D).
The channel encoder adds redundancy so as to achieve immunity to channel
errors.
We next show an interesting channel coding interpretation for the source
coding theorems of Sec. 7.2. For the general discrete memoryless source, represen-
tation alphabet, and distortion measure defined earlier, consider Pp, = {P(v|u):
D(P) < D} for some fidelity D. For any P € Ap, define the channel transition
probability for a discrete memoryless channel with input alphabet ¥ and output
alphabet W as
P(v|ujQlu)
~ Y Plelw oti
This is sometimes referred to as the “ backward test channel.” Now consider any
source code ZB = {v,, Vv, ..., Vy} of rate R and block length N. We can regard {v,,
V,,---, Vy} as a channel code’? for the above backward test channel as shown in
Fig. 7.7. Assume that the codewords are equally likely so that the maximum
probability of correct detection, denoted P.(V¥o, ¥;, ..., Vy), Would be achieved by
the usual maximum likelihood decoder. But suppose we use a suboptimum chan-
nel decoder which uses the decision rule, for given channel output u € WV,
Q(u|v) for allue WvEeVv (7.3.11)
choose VeE{¥o,Vy,..., Vy} | which minimizes dy,(u,v) (7.3.12)
Then for this suboptimum decoder, the probability of a correct decision, denoted
P.(V¥o; V1, ---> Yu), is upper-bounded by
P.(¥o, 43 ---5 Yu) = wo. ¥ Pry )< min dy(u, Vy") |¥mn iS sent|
m’'#=m
SF il6i Vs 5 Mie)
Bt Cte isis eed (7.3.13)
where the last inequality follows from the strong converse to the coding theorem
(Theorem 3.9.1). We now use (7.3.13) to show why in the source coding theorem
'! The vector vg plays the same role as the dummy vector v in the proof of Lemma 7.2.1.
Vo dy (U, Vo )
fv, dy (U,V, )
Yo DMC u
V> oe O (u | v) Pa men dy(u, V>)
@ <x Uy \
: Yu dy(U, Vy) Figure 7.7 Backward test channel.
408 SOURCE CODING FOR DIGITAL COMMUNICATION
(Theorem 7.2.1, also see Lemma 7.2.1) the source coding exponent, E(R, D), is
essentially the exponent in the strong converse to the coding theorem.
We are primarily interested in the term Pr{dy(u, Vo) < min,,¢9 dy(U, V,)|Vo is
sent} which may or may not be larger than P.(vo, v;, ..., Vy). However, if we
average (7.3.13) over the ensemble of codewords {v,, ¥;, ..., Vy} where all com-
ponents are chosen independently according to {P(v): v ¢ W’}, we have’?
Pridy(u, Yo) < min d,(u, V,,) |Vo is sent
m#0
= P (vo, Vis eeey "m)
<P {v5 V1, i hao)
Se AR 8 ee © (7.3.14)
which is exactly Lemma 7.2.1. Then, as in Sec. 7.2, for Pe Ap we have average
distortion
d(Z)<D+dp5 Pr lay U, Vo) < min dy(U, V,,) |Vo 1S Ey
| m#0O |
< D + dye MEd. P)- eR (7.3.15)
where
max [E,(p,P)—pR]>0 for R>I(P)
—if<sp=¢0
Here we see that the source coding theorem can be derived directly from the
strong converse to the coding theorem due to Arimoto [1973] by applying it to the
backward test channel corresponding to any P € #, as shown in Fig. 7.7. Since
the strong converse to the coding theorem results in an exponent that is dual to
the ensemble average error exponent, the source coding exponent is dual to the
ensemble average error exponent. |
Perhaps the least direct relationship between channel and source coding
theories is the relationship between the low-rate expurgated error bounds of
channel coding theory and the natural rate distortion function associated with the
Bhattacharyya distance measure of the channel. In particular, suppose we have a
DMC with input alphabet %, output alphabet ¥Y, and transition conditional
probabilities {p(y|x): y ¢ Y, x € X}. For any two channel input letters x, x’ € %,
we have the Bhattacharyya distance defined [see (2.3.15)] as
d(x, x’) = —In } \/p(y|x)p(y |’) (7.3.16)
and we suppose that the channel input letters have a probability distribution
{q(x): x € 2}. Alternatively, for a source with alphabet WY = 2, probability distri-
bution {q(x): x € 2}, representation alphabet Y = 2%, and the Bhattacharyya dis-
tance in (7.3.16) as a distortion measure, we have a rate distortion function which
12 We again use the overbar to denote the code ensemble average. Symmetry gives the equality here.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 409
we denote as R(D; q). This leads us to define the natural rate distortion function for
the Bhattacharyya distance (7.3.16) as
R(D) = max R(D; q) (7.3.17)
To show the relationship between R(D) and the expurgated exponent for the
DMC, let us consider the BSC with crossover probability p. Here ¥ = W = {0, 1}
and the distortion measure is
fag os 0 2x!
\—In /4p(1—p) x#x'
Thus letting « = —In ./4p(1 — p), we see that the Bhattacharyya distance is pro-
portional to the Hamming distance. It is easy to show (Sec. 7.6) that
(7.3.18)
R(D) = max (a) — (>)
q
=In2—- (2) (7.3.19)
and the corresponding source is the Binary Symmetric Source (BSS).
Recall from Sec. 3.4 [see (3.4.8)] that, by the expurgated bound for the BSC,
there exists a block code @ of block length N and rate R such that
Py < e7 NER) (7.3.20)
where D = E,,(R) satisfies R = R(D), and R(D) is given by (7.3.19). Hence, we see
that the natural rate distortion function for the BSC yields the expurgated expo-
nent as the distortion level.
We can also prove the Gilbert bound discussed in Sec. 3.9 by using the above
relationship with rate distortion theory. Let
d(N, R) = max d,,;,(@) (7.3.21)
€
where
dinin(@) = min dy(x, x’) (7.3.22)
x,x’€@
x Fx’
and where the maximization is over all codes of block length N and rate R. Next
let @* be a code of block length N and rate R that achieves the maximum
minimum distance with the fewest codeword pairs that have the minimum dis-
tance d(N, R). Hence
d(N, R) Pir Amin(@*)
>d(x|@*) = forallxe %y (7.3.23)
where
d(x|@*) = min dy(x, x’)
; *
x’ E€€
410 SOURCE CODING FOR DIGITAL COMMUNICATION
This inequality follows from the fact that if there exists an x* € 2, such that
d(x*|@*) > dmin(@*), then by interchanging x* with a codeword in @* that
achieves the minimum distance when paired with another codeword, there would
result a new code with fewer pairs of codewords that achieve the minimum dis-
tance. This contradicts the definition of @*. With (7.3.23), we can now prove the
Gilbert bound.
Theorem 7.3.3: Gilbert bound
d(N, R)>D
where
R= Rip tno = (>) (7.3.24)
and D,, = D/« is the Hamming distance.
Proor @* defined above is a code of rate R which has average distortion
d(@*) satisfying
d(6*) = ¥) an(x)d(x | 6*)
< dmin( 6 *)
= d(N, R) (7.3.25)
where (7.3.23) is used in this inequality. Here we consider @* as a source block
code. The converse source coding theorem (Theorem 7.2.3) states that any
source code @* with distortion d(@*) must have
R > R(d(¢*)) (7.3.26)
Since D is given by (7.3.24), we must have R(D) = R > R(d(@*)). Then since
R(D) is a strictly decreasing function of D on 0 < D < «/2, we have
d(N, R) > d(@*)
D
IV
The results for the BSC generalize to all DMCs, when we use the Bhatta-
charyya distance, if for the parameter s such that D = D,, the matrix [e“" *”] is
positive definite (see Jelinek [1968b] and Lesh [1976]). This positive definite condi-
tion holds for all s <0 in most channels of interest. This shows that, for an
arbitrary DMC, the Bhattacharyya distance is the natural generalization to the
Hamming distance for binary codes used over the BSC, and a generalized Gilbert
bound analogous to Theorem 7.3.3 can be found (see Probs. 7.8 and 7.9).
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 411
7.4 DISCRETE MEMORYLESS SOURCES—TRELLIS CODES
For block codes we have demonstrated a duality between source and channel
coding where the source encoder performs in a manner similar to the channel
decoder and the source decoder performs like a channel encoder (see Fig. 7.6).
This duality holds also for trellis codes. We now proceed to show that trellis codes
can be used for source coding where the source encoder performs the operations
that are essentially the operations of the maximum likelihood trellis decoding
algorithm of channel coding, while the trellis source decoder is essentially a trellis
channel encoder. In particular, we show that it is possible to use trellis codes,
which are general forms of convolutional codes, to achieve the rate distortion limit
(D, R(D)) of a discrete memoryless source.
Furthermore, the same algorithm which attains the channel coding bound
with convolutional channel codes (Viterbi [1967a]) also attains the source coding
bound with trellis (generalized convolutional) source codes. In this context,
however, the term “maximum likelihood” does not apply.
We again consider a discrete memoryless source with alphabet W = {ay,
a,,..., a4} and nonzero letter probabilities Q(a,), Q(a,), ..., Q(a,). The user
alphabet is denoted by VY = {b,, b>, ..., bg}, and there is a nonnegative bounded
distortion measure d(u, v) which satisfies
0 <d(u, v) <do (7.4.1)
for allue Wve V and some dy < ©.
Trellis codes are generalized convolutional codes generated by the same shift
register encoder as convolutional codes, but with arbitrary delayless nonlinear
operations replacing the linear combinatorial logic of the latter. Whether fixed or
time-varying, they can most conveniently be described and analyzed by means of
the familiar trellis diagram of Chap. 4. Figure 7.8 shows a trellis source decoder
and Fig. 7.9 shows the corresponding trellis diagram for the binary-trellis code
with K — 1 delay elements and a delayless transformation. Following the same
convention as for channel convolutional codes, we will refer to K as the constraint
length of the trellis code. We assume for the present a binary-trellis code with n
destination symbols per branch, resulting in a code rate r = 1/n bits per source
xei0, 1}
Xx;
x; Xj Xj-2 Ni-(K -1 )
vy aes y
_ . one 8 p_.
Vi = ns p> ni+2? ni+n)
Delayless transformation >
Figure 7.8 Trellis source decoder.
412 sOURCE CODING FOR DIGITAL COMMUNICATION
State 0 1 gh A } ag Chee gi te Rte? LX E=? LK =)
cial eh Dai atbcal se oitio pel emmale deel
(2% -! states)
Figure 7.9 Trellis diagram.
symbol. This means that, for each binary input, the trellis source decoder emits n
symbols from ¥, and a sequence of binary input symbols defines a path in the
trellis diagram. We can easily generalize to nonbinary trellis source decoders later.
Here, each branch of the diagram is labelled with the corresponding n-
dimensional destination vector in ¥,,, and the states (contents of the source
decoder’s first K — 1 delay registers) are denoted by the vertical position in the
diagram, also shown at the left of the trellis diagram. The trellis is assumed to be
initiated and terminated in the 0 state, and no encoding or decoding is performed
during the final merging in what we will call the “tail” of the trellis. There are
2*~! states, and we assume that the trellis source coding operates continuously for
many source symbols so that the effects of the tail can be ignored. We let the total
code length be L branches, while the tail requires K — 1 further branches.
The source encoder searches for that path in the trellis whose destination
(user) sequence v most closely resembles (in the sense of minimum distortion) the
source sequence u. Once the source encoder picks a path, then it sends binary
symbols x through the channel (again taken to be noiseless) which drives the
trellis source decoder through the desired sequence of states yielding the desired
path v as the trellis source decoder output. Figure 7.10 shows the block diagram
for the trellis source coding system.
We assume that the trellis source coding system operates continuously for a
long time between initial fan-out and final merging. This means we assume that
L > K and that the effects of the tail can be ignored. In particular, we will ignore
the last K — 1 branches where all paths merge to the zero state. Hence, the total
code length is taken to be L branches and we have a total source sequence length
of N,; = nL. There are many possible paths or trellis codewords of length N,, one
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 413
of which must be chosen to represent the source sequence. For a given source
sequence u and any trellis codeword v, we have the distortion measure
dy, (u, v) = — Y d(u,, v,) (7.4.2)
The source encoder chooses the path corresponding to the trellis source de-
coder output v that minimizes dy (u,v). Defining, for each index ie {0, 1,
2,..., L — 1}, the subsequences of length n
U; ms (Uni+ 1> Uni+2> SA °% Uni +n)
7.4.3
uf Bina (Uni + 1> Unit 29 -+*> Uni+n) ( )
and branch distortion measures
1 n
d,,(u;, v;) “= n > A(Uni+1> Uni+t)> (7.4.4)
t=1
we can rewrite dy, (u, v) as
1 L-1
dy, (u, v) = 5 >> d,(u;, ¥;) (7.4.5)
i=0
In this form it is clear that the source encoder selects a path in the trellis which
consists of a sequence of L connected branches, where each branch adds an
amount of distortion that is independent of the distortion values of other branches
in the path. For a given source sequence u, the source encoder’s search for the path
Viterbi algorithm
Source > Search for the
minimum distortion x,€ {0, 1}
path in the trellis
Source encoder
Noiseless
channel
Trellis code
User = -
Follow the path
indicated by x
Source decoder
Figure 7.10 Trellis source coding system.
414 SOURCE CODING FOR DIGITAL COMMUNICATION
in the trellis that minimizes dy, (u, v) is equivalent to the channel decoding search
problem where the Viterbi algorithm was used to find the path, or convolutional
codeword, that minimizes the negative of the log-likelihood function. Hence the
source encoder for trellis codes can be realized with the Viterbi algorithm.
For the given source and distortion measure, we have shown in Sec. 7.2 that
the rate distortion function R(D) is given by (7.2.53). Regardless of what type of
source coding system we consider, the converse source coding theorem (Theorem
7.2.3) has shown that it is impossible to achieve average distortion of D or less
with a system using rate less than R(D). This converse theorem applies to trellis
source coding as well as to block source coding (Sec. 7.2). We have also shown
that, in the limit of large block lengths, block source coding systems can achieve
average distortion D with rate R(D) nats per source symbol, thus justifying R(D)
as the rate distortion function. In this section, we will show that, in the limit of
large constraint length K, trellis codes can also achieve the rate distortion limit.
We again appeal to an ensemble coding argument where we consider an
ensemble of binary trellis source codes of constraint length K and bit rate r = 1/n.
The ensemble and the corresponding distribution are so chosen that each branch
of the trellis diagram has associated with it a user or representation sequence
consisting of symbols with common probability distribution {P(v): v « W} with
independence among all symbols. Now for any given source sequence u and any
given trellis code, we denote the minimum distortion path sequence as v(u). Thus
by definition, we have the bound dy (u, v(u)) < dy,(u, v) for any other path se-
quence v belonging to the trellis code. We now choose v = y* as follows:
1. For a given trellis code and the given source sequence u, replace the representa-
tion sequence of the all-zeros state path by the sequence v, randomly selected
according to the conditional probability
P(vo|u) = Tf Pleo) or it)
This results in a new trellis code which differs from the original trellis code only
in the branch values of the all-zeros state path. We call this modified trellis code
a forbidden code, since in general we are not allowed to select parts of a trellis
code after observing the source output sequence u. Note that the original code
and the corresponding forbidden code differ only in the forbidden code path
corresponding to the all-zeros state path.
2. Given a source sequence u, for the above forbidden code, let v** be the mini-
mum distortion path sequence. That is, let v** correspond to the forbidden
trellis code output sequence which represents u with minimum distortion.
3. v** defines a path through the forbidden trellis diagram. Now choose v* as the
corresponding path sequence in the original trellis diagram. Hence v** and v*
are the same except for the subsequences on branches of the all-zeros state
path.
Note that v* is a trellis code sequence in the originally selected trellis code, and we
introduced the forbidden trellis code only as a means of selecting this trellis code
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 415
sequence. We never use the forbidden trellis code in the actual encoding of source
sequences and require it only to derive the following bounds. Since v* is a path
sequence in the trellis code, we have from the definition of v(u)
dy (u, v(u)) < dy, (u, v*) (7.4.7)
We now derive a bound on dy,(u, v(u)), where (-) denotes an average over all
source sequences and trellis codes in the ensemble. We do this by bounding
dy ,(u, v*).
Lemma 7.4.1
do | cade Be Fie oe |
dy (u, v(u)) < D(P) + i YiokP, (7.4.8)
j=0 k=1
where
v* merges with the all-zeros state
P ,, = Pr. path at node j and remains merged ; (7.4.9)
for exactly k branches
and
P) = YY Qlw)P(v|u)d(u, v) (7.4.10)
PROOF For a given source sequence u and for v* as selected above, let # =
{i: v¥ is a branch output vector of the all-zeros state path}. Then
Ldy (u, v*) = id, u;, V*)
= )' d,(u;, v¥)+ > d,(u,, v¥) (7.4.11)
i¢ Zz ie Zz
For i¢ & we use the bound d,(u;, v*)<d), while for i¢ # we have
d,(u;, v*) = d,(u, vF*).
Hence
Ldy,(u, v*) < > d,(u;, v¥*) + ¥ do
i¢z ie Zz
1 eee!
tg Se rad Bal oe
i=0 iez
tigress |
< ¥ d,(u;,¥o;)+ > do (7.4.12)
i=0 iez
where Vo; is the ith branch output vector of the all-zeros path of the forbidden
code. This last inequality follows from the fact that, by the definition of v**,
we have
dy (u, v**) < dy, (U, Vo) (7.4.13)
416 SOURCE CODING FOR DIGITAL COMMUNICATION
in the forbidden trellis code where vp is the all-zeros state output sequence.
From (7.4.7) and (7.4.12), we obtain the inequality
dy,(u, v(u)) <dy,(U, Vo) + ) = (7.4.14)
When we average (7.4.14) over all source sequences and over the trellis code
ensemble, the first term becomes D(P). Using the definition of P;, given in
(7.4.9), we employ the union-of-events bound on the second term to get the
desired result.
There remains only the evaluation of a tight bound for P;,. This is computed
over the ensemble of forbidden trellis codes which consist of the normal codes
with the branch vectors of the all-zeros state path v, selected according to (7.4.6)
for each source sequence u. Note that when v* merges with the all-zeros state path
at node j and remains merged for exactly k branches, in the corresponding forbid-
den code v** also is merged with the all-zeros state for the same span. Hence Pj, is
also the probability that, in the forbidden trellis codes, v** (the minimum distor-
tion path) merges with the all-zeros state path at node j and remains merged for
exactly k branches.
Let x** be the binary input sequence to the forbidden trellis decoder that
yields the minimum distortion codeword v**. If v** merges with the all-zeros state
for exactly k branches starting with the jth node, the binary sequence x** has the
form
“"" Qy A,°*°* Ax-4 1 0 0 “++ Q 0 0:::-0 1 b, By s*i bgiy Lee (7.4.15)
t t t t
nodej—K nodej nodej+k nodej+k+kK
At node j — K, we take the forbidden trellis decoder to be in state a = (a;, a2,...,
ax-,), and at node j+k+K to be in state b = (by, bz, ..., bg_,). The “1”
immediately following node j — K is required, for otherwise merging could not
start exactly at node j. Similarly a “1” must follow node j + k, for otherwise the
merged span would be longer than exactly k as assumed. The merged span is
shown in Fig. 7.11. Note that states a and b are unrestricted, and either or both
may possibly be the all-zeros state.
Now for the moment let us assume that states a and b are fixed. That is, the
trellis path corresponding to the minimum-distortion forbidden trellis decoder
output, v**, is assumed to have passed into state a at node j — K and state b at
node j + k + K. Then we seek the probability that the subpath with decoder input
sequence
a100--001b (7.4.16)
is the minimum distortion path (subsequence of x**) from state a to state b in the
forbidden trellis code. Any other path from a to b has an input of the general form
aXj;-xKXXj4,b (7.4.17)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 417
j-K ante j j+1 j+2 oS j+k oe oat ee ae 4
6) OQ
3
7 mE
Ff Nee
S77 NU:
6
nae \t
Figure 7.11 Merger with the all-zeros state path.
where X = (X;~ x44, ---» Xj+x-1)- Since the probability that path a 1 0 1 b is the
minimum distortion path among all paths of the general form a x;_x x x;+, b is
upper-bounded by the probability that path a 1 0 1 bis the minimum distortion
path among all paths of the restricted form a 1 x 1 b, we now consider only paths
of this restricted form. Let v(x) be the forbidden trellis decoder output for the
(k + 2K) branches going from state a to state b corresponding to the input
a 1 x 1 b. Then for random source subsequences of length n(k + 2K), denoted
u, and for the ensemble of forbidden trellis codes, we seek to bound P,, by first
bounding the probability’?
P (a, b) = Pr atu v(0)) < min d(u, v(x))} a, |
xFO
By restricting our attention to subpaths from state a to state b of a forbidden
code, we have formulated the problem as a block source coding problem. Our
bound on P, will be developed in a way analogous to the block coding bound of
sec. 7.2.
Lemma 7.4.2 Over the trellis code ensemble just defined,
P » J he? = ME Ap. PYR— P) (7.4.18)
where
1+ p
E,(p, P) = —In ¥ (> P(o}Q(u|v)"""*”} —-l<p<0O (7.4.19)
and
R=r In 2 = (In 2)/n
*> To simplify the notation, in the following, when the subscript on the distortion d(- , -) is missing,
we assume that, as always, it is defined by the dimensions of the vectors involved.
418 SOURCE CODING FOR DIGITAL COMMUNICATION
PROOF We now require some notation to separate branch vectors of the all-
zeros path from other branch vectors of the forbidden trellis. As was discussed
above, we are concerned only with branch vectors associated with paths in the
trellis of the form a 1x 1b. Hence our notation refers only to quantities
associated with these paths.
uc denotes the source subsequence over the central k branches
u;, denotes the source subsequence over the first K and final K branches of
the subtrellis under consideration
v‘(0) denotes the branch vectors of the all-zeros state path over the central k
branches
Vj, denotes the collection of all branch vectors over the central k branches
not belonging to the all-zeros state path
If u is the subsequence of the source in going from state a to state b, then we
have Q(u) = Q(u,,)Q(u‘), since all components of u are independent and iden-
tically distributed. The term v‘(0) also represents the only part of the all-zeros
path of the On trellis that is relevant; it is a random sequence selected
according to P(v‘(0)|u‘) and is independent of Y ;, - Vectors v‘(0) and V ;,
comprise all ina fea vectors corresponding to paths in the forbidden trellis
code with binary subsequences of the form a 1 x 1 b. Hence all the quantities
of interest have the joint probability distribution ‘+
z (Vins v‘(0), Ujx > u’) = P (VY 4,)Q(uj,)Q(u°)P (v°(0) |u‘)
Now we define the indicator function
1 = d(u, v(0)) < min d(u, v(x))
O(u, v°(0); Vj.) = 27? (7.4.20)
0 otherwise
Then
P (a, b) = Pr \d(u, v(0)) < min d(u, v(x))
x0
‘ a Ld dX PY n)O(uj,)Q(u‘)P(v(0)|u°)O(u, v0); Wx)
V jk Wjp ue ve(0)
a, b
=LV UL DY PY «)Q(ujx)P(v(0))O(u‘|v(0))O(u, v0); x)
V jk Uj, uc ve(0)
< ECE Pala) | Y Peioyoru|royr |”
V jk Wj, we ve(0)
3 P(v‘(0))®(u, v°(0); rn) a (7.4.21)
v-(0)
‘* Note that uj, and Y ,, are independent since they refer to disjoint segments.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 419
where the Holder inequality is used with —1<p<0. Next the Jensen
inequality yields the further bound
P,(a,b)< ¥ Olas) FE Pevio)our|(oyrer” |
U jx uc v-(0)
ae
[EE Pr a PEr(O)yO(a, v0}: )] (74.2)
V jx vO)
In the last bracketed term, we note that there is now complete symmetry for
all paths involved since the section of the all-zeros state path v‘(0) has the
probability P(v‘°(0)) induced by definition (7.4.6), which is the same for all
branch vectors in Y ,, and thus, since all of the 2***~' paths'® of the form
a 1 x 1 b have the same statistical properties, we have
EDP PHO) V0); Hn) = see (74.23)
v jk vo(O)
independent of u. Hence
P,(a,b)< >
uc
» P(v‘(0))Q(u‘ | v‘(0))/a+) | P+ K- 1)p
ve(0)
kn
= 2k+K-1)p » (> Pecjatujay'r) ]
— 2(K- 1)p7—KLE(p, P)/R —p] (7.4.24)
Here we use the fact that all components of all vectors are statistically
independent of each other. Since the bound on P,,(a, b) is independent of
states a and b, we obtain the desired result.
Using this bound on P;,, we now obtain from (7.4.8)
L-1 L-j-1
dy,(u, v|u) < D(P) +o YY ee ee
j=0 k=1
< D(P) + do Se k2‘K- 1)p7 — MEo(. P)/R — p)
k=0
oe D(P) + r oS seal 1)p 5 k2~HE dl P)/R—-p]
k=0
do Q(K- De
[1 — 2~ Wate, PyR- or)? (7.4.25)
= D(P) +
provided E,(p, P)/R — p > 0. Recall that E,(p, P) is the Gallager function whose
properties are given in Lemma 7.2.2. From (7.2.58), we have
p .B
E,(p, P) = pI(P) + | | Es(a, P) da dp
'S This corresponds to all paths over the k central branches of the subtrellis starting in any one of
the 2*~! possible states.
420 SOURCE CODING FOR DIGITAL COMMUNICATION
where
For —4 <p <0, we have the bound (see Prob.7.3)'®
E"(a, P) > —C, = —2— 16[In AP? (7.4.26)
which yields
2
E,(p, P) > pI(P) — C= (7.4.27)
This inequality is then used to bound
E,(p, P) I(P) C,(p?
Ro See ope fa RS
I(P)—R_ C,(p?
=p ( : bat (5) (7.4.28)
Next we choose
I(P)—R
p= feed <0 (7.4.29)
C,
so that the lower bound in (7.4.28) becomes
I(P)—R] _C,(p?\ _ (IP) — R)
Rae cn] fall | 7.4.30
o| R | ae 2RC, ies
Defining
py tos eRe Te)
E(R; P) = —p C.
we have
dg. 1)E(R;P)
dy ,(u, v(u)) => D(P) ‘c [1 Pe 2 ere (7.4.31)
where E(R; P) > 0 for R > I(P).
Recall from (7.2.53) that the rate distortion function is
R(D) = min I(P)
Pe Pp
‘© Actually any bounded number larger than 2 + 16[In A]? will suffice. By choosing C, large enough
we can always choose p in (7.4.29) such that p > —4.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 421
where
Py = \P(v|u): d d Q(u)P(v|u)d(u, v) < D
For P* € #, that achieves R(D) = I(P*), we define
_ EAR, D) = E(R; P*) =
and we have the source coding theorem.
Theorem 7.4.1: Trellis source coding theorem Given distortion D, for any
constraint length K and rate R = (In 2)/n > R(D) for any integer n, there
exists a binary trellis code T, with average distortion d(T,) satisfying
dy 27 K~ WEAR, D)
d(T) $ D+ >= cramer. p (7.4.32)
where
— R(D
E(R, D) = ao +0 (7.4.33)
Proor The only additional observation we make from (7.4.31) is that at least
one code has average distortion less than or equal to the ensemble average
distortion.
This theorem shows that in the limit of large constraint length K we can
achieve the rate distortion limit (D, R(D)) with the trellis source coding system
shown in Fig. 7.10. Furthermore, it gives a bound on the distortion achievable
with finite constraint length.
Up to this point we have considered trellis decoders with only binary inputs,
which corresponds to a trellis diagram where only two branches leave each node.
We can easily generalize to the case where the decoder has one of g inputs so that
the corresponding trellis diagram has g branches leaving each node. Over the
noiseless channel, the encoder sends qg-ary symbols for each n source symbols so
that, for these codes, the rate is
l
yom TE bits/source symbol
or
In
R= ze nats/source symbol (7.4.34)
There are still n representation symbols for each branch. The proof is essentially
the same as for the binary case where g = 2, but it requires conditioning on states
422 SOURCE CODING FOR DIGITAL COMMUNICATION
a, b and on the two nonzero symbols that follow a and precede b. For arbitrary
integer q, (7.4.8), (7.4.9), and (7.4.10) are the same, but now P,, is bounded by
P x, a "aa 1)pg- MEolp. PR — pI (7.4.35)
Hence, for this more general case, we have the same source coding theorem with
(7.4.32) replaced by
—(K—1)E,(R, D
do qv K~ EAR. D)
d(Tx) = D+ [1 oe gels aay (7.4.36)
where
R— R(D
E(R,-D) = —
To examine the rate of convergence to the rate distortion limit (D, R(D)) as
constraint length K increases, we merely substitute E,(R, D) into the bound
(7.4.36) and rewrite this as
do q7 K~ WR-RONICo
a [1 — go R R@vP/2RCo)2
d(T) < D
dy q~ ROC
a a [1 bi gq” (R- RD)P/2RCo]2 ee (7.4.37)
where N, = nK = (K/R) In q is the equivalent block length. Comparing this with
the convergence of block source coding given by Theorem 7.2.4, we see that this
bound on distortion has an exponent proportional to R[R — R(D)], whereas with
block codes the exponent is proportional to [R — R(D)]?/2. We observed similar
superiority for convolutional codes over block codes in channel coding.
In this section we described and analyzed trellis source decoders and the
optimum trellis source encoder implemented by the Viterbi algorithm. As with
channel coding, the computational complexity of the optimum source encoder
grows exponentially with constraint length. It is natural to consider suboptimum
path search algorithms such as the sequential decoding algorithms for channel
convolutional codes. These algorithms can reduce the computation required per
source symbol.
For the most part, sequential algorithms for source encoding are best
analyzed in terms of tree codes which are trellis codes with infinite constraint
length, K = oo. This results in a trellis diagram that never has merging nodes but
continues to branch out forever with independent random representation vectors
on all branches. For a tree source decoder, the optimum source encoder finds the
path in the tree that can represent a source sequence with minimum distortion.
Jelinek [1969], by using branching theory arguments (see Probs. 7.10 and 7.11),
was the first to show that tree codes can achieve the rate distortion limit (D, R(D)).
We obtain the same result by letting K = oo in (7.4.37).
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 423
Note finally that the source trellis or tree encoder need not necessarily find the
unique path that represents the given source sequence with minimum distortion.
There may, in fact, be many paths that can represent a source sequence within a
desired fidelity criterion D and so it is natural to consider various sequential
search algorithms which choose the first subpath that meets a fidelity criterion.
Anderson and Jelinek [1973] and Gallager [1974] have proposed and analyzed
such algorithms and have shown convergence to the rate distortion limit for
various sources. Sequential algorithms of this type, although suboptimum, yield
less complex trellis or tree source encoders and still achieve the rate distortion
limit.
7.55 CONTINUOUS AMPLITUDE MEMORYLESS SOURCES
Many sources, such as sampled speech, can be modeled as discrete-time sources
with source outputs which are real numbers; that is, a source with alphabet
R = (—., oc). We now consider such discrete-time continuous-amplitude mem-
oryless sources where the source and representation alphabets are the real num-
bers, the source output at time n is a random variable u, with probability density
function {Q(u): —0o <u < oo},!’ and we have a possibly unbounded nonnegative
distortion measure d(u, v) for each u, v € &. All outputs of the source are indepen-
dent and identically distributed. ve distortion PAL sequences u and v of
length N is again defined as dy(u, v) = (1/N) }%-, ). Besides the fact that
the source outputs are now continuous real ale nate the main difference
from the previous sections is the fact that the single-letter distortion measure can
be unbounded. This is to allow many common distortions such as the magnitude
error, d(u, , and the squared error, d(u, v) = (u — v)*, distortion
measures. To overcome the fact that we no longer have a bounded distortion
measure, we require instead the condition
ie)
| Q(u)d?(u, 0) du < d2 (7.5.1)
for some finite number d,. This is the condition that the random variable d(u, 0)
has bounded mean'® and bounded variance which is satisfied in most cases
of interest. Throughout the following we will assume this condition for continuous
amplitude sources and distortion measures.
17 We shall denote all probability density functions associated with source coding with capital
letters.
‘8 HOlder’s inequality (App. 3A) applied to (7.5.1), the bounded variance condition, implies a
bounded mean, that is
| * Q(u)d(u, 0) du < dy
bag:
424 SOURCE CODING FOR DIGITAL COMMUNICATION
7.5.1 Block Coding Theorems for Continuous-Amplitude Sources
Again referring to Fig. 7.3 for the basic block source coding system, let
B =V1, V2, ..., Vy} be a set of M representation sequences of N user symbols
each, which we call a block code of length N and rate R = (In M)/N nats per
source symbol. For this code the average distortion is now
ie.) 0
d(@) = | ve | Ox(u)d(u| #) du (7.5.2)
where
d(u|#) = min dy(u, v)
On(u) = [] Ou)
In proving coding theorems for block codes we essentially follow our earlier
proofs for the discrete memoryless source presented in Sec. 7.2, the main differ-
ence being that integrals of probability density functions replace summations of
probabilities. As before, we use an ensemble average coding argument by first
introducing the conditional probability density function
Palv|u) = T] Posl) (7.5.3)
and the corresponding marginal probability density on By
Puv)= J | Qn(u)Pa(v|u) du
=
=] Pe) (754)
where
oO
P(e) =| Q(u)P(v|u) du (7.5.5)
Proceeding exactly as in Sec. 7.2 [Eqs. (7.2.8) through (7.2.11)], but with summa-
tions replaced by integrals, we obtain
oO 00
d(Z) = | es | Ox(u)Py(v |u)d(u| B)[1 — O(u, v; B)] du dv
+f] OnPa(e luda] a}O(a, v5 4) du dv (756)
(ee)
where
(7.2.10)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 425
Now, defining (as in 7.2.15)
DIP)=| | Q(u)Plv|ud(u, v) du do (7.5.7)
we see that since 1 — ® < 1, the first integral is easily bounded to yield
d(Z) < D(P) + E : Qy(u)Py(v|u)d(u| Z)O(u, v; A) du dy (7.5.8)
fo.)
To bound the second term we can no longer appeal to the argument that
d(u|#) is bounded by dy. Instead we use a simple form of the Holder inequality
(App. 3A)
= 2
ie 6)
ie Qy(u)Pyx(v |u)d(u| Z)O(u, v; B) du dv
. 0 a0 1/2
< [J | entratr|uyatu|)) au dy]
a) fe @) 1/2
“ | ves | Qx(u)Px(v|u)®(u, v; A) du i (7.5.9)
where we noted that ®? = @. Next we assume that v, €¢ B = {v,, V2, ..., Vy} is the
all-zeros vector; that is, v; = 0. Then d(u|#) < dy(u, v,) = dy(u, 0) and
oO fee) ie.)
Ff QstupPa(v|u(ata| a)? du av =| f Qylu)(d(u|.9))? du
—-@ _ —_
~2 ~®
< |] Qp(u)(ds(u, 0))? du
- f we fe Qx(u) *s D dle 0 du
2 ~ @ ape :
e [ gh [ Qx(u) = Za (u,,; 0| du
< dj
(7.5.10)
where the last inequality follows from (7.5.1). Hence when v, = 0 € Z, we have
d(B) < D(P) + do if a fe Qy(u)Py(v|u)®(u, v; A) du iv] (7.5.11)
—-@ pa
We now proceed to bound the ensemble average of d(#). We consider an
ensemble of codes in which vy, = 0 is fixed and the ensemble weighting for the
remaining M — 1 codewords is according to a product measure corresponding to
independent identically distributed components having common probability den-
426 SOURCE CODING FOR DIGITAL COMMUNICATION
sity {P(v): -co <v< oo}. Now for any code #={v,,V2,..., Vy}, define
B = v3, V3, .--, Vy} which is the code without codeword v, = 0. Then clearly
d(u| ZB) < d(u| ZF) (7.5.12)
and
O(u, v; Z) < Du, v; Z) (7.5.13)
Hence for any code &, we have from (7.5.11) and (7.5.13)
d(B) < D(P) + do if ing i Oy(u)Py(v|u)®(u, v; 4) du dv 2 (7.5.14)
— oo _
Now averaging this over the code ensemble and using the Jensen inequality yields
d(B) < D(P) + do ic a f (ye, (vl wyOlw: 7: B) de ore
< D(P) + do re wo hs O(ulPo(rinyin,.¥; B) au as 1/2 eae
— = @ 55
The term inside the bracket can now be bounded by following the proof of Lemma
7.2.1 [(7.2.20) through (7.2.22)], replacing summations of probabilities with inte-
grals of probability densities. This yields the bound
fe | Ox(u)Px(v[uj®(u, v5 B) du dv <e"MFRMP) (7.5.16)
—-Of) nf
where
E(R; p, P) = —pR + E,(p, P)
—-l1<p<0O
fo @) itp
E,(e, P) = —In i I P(v)Q(u|v)'/*” dv du (7.5.17)
=e 7
The properties of E,(p, P) are the same as those given in Lemma 7.2.2 where now
I(P) is
aoa a P(v|u)
(P)=| | Q@)P(v|) In Pi)
the average mutual information. Then it follows from Lemma 7.2.3 that
du dv (7.5.18)
max E(R; p, P)>0 for R > I(P) (7.5.19)
—1spsod
Combining these extensions of earlier results into (7.5.15) yields
d(B) < D(P) + do e71/2NERi 0. P) (7.5.20)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 427
where
max E(R;p,P)>0 forR>I(P)
SLT epaU
At this point we are still free to choose the conditional probability density
{P(v|u): u, v € Z\ to minimize the bound on d(¥). Suppose the fidelity criterion
to be satisfied by the block source coding system is specified as D(P) < D. Then let
Py = {P(v|u): D(P) < D} (7.5.21)
and define
E(R, D)= sup max E(R; p, P) (7.5.22)
Pe Pp —1<p<0
and the rate distortion function
R(D) = inf I(P) (7.5.23)
Pe Pp
Applying these to (7.5.20) yields the coding theorem for continuous-amplitude
memoryless sources.
Theorem 7.5.1: Source coding theorem For any block length N and rate R
there exists a block code with average distortion d(Z) satisfying
Hc D+d<-°""” (7.5.24)
where
E(R, D) > 0 for R > R(D)
PRrooF See the proof of Theorem 7.2.1.
We defined R(D) in (7.5.23) as the rate distortion function for a continuous-
amplitude memoryless source, where the unbounded single-letter distortion meas-
ure satisfies a bounded variance condition. To justify this definition we need to
prove a converse theorem. This is easily done using the same basic proof given
earlier for the discrete memoryless sources (see Theorem 7.2.3).
Theorem 7.5.2: Converse source coding theorem For any source encoder-
decoder pair, it is impossible to achieve average distortion less than or equal
to D whenever the rate R satisfies R < R(D).
The proof of the direct coding theorem given in this section certainly applies
as well for discrete memoryless sources with unbounded distortion measure as long
as the bounded variance condition is satisfied. Similarly for discrete sources with a
countably infinite alphabet we can establish coding theorems similar to Theorem
7.5.1. An example of such a source is one which emits a Poisson random variable
and which has a magnitude distortion measure (see Probs. 7.25, 7.26, and 7.27).
428 SOURCE CODING FOR DIGITAL COMMUNICATION
Here we have shown that coding theorems can be obtained for continuous
amplitude sources using proofs that are essentially the same as those for discrete
memoryless sources with bounded single-letter distortion measures. All of the
earlier discussion concerning the relationship between channel and source coding
also applies. In fact, the trellis coding theorem can also be extended in this way, as
will be shown next.
7.5.2 Trellis Coding Theorems for Continuous-Amplitude Sources
We extend the results of Sec. 7.4 to the case of continuous-amplitude memoryless
sources with unbounded distortion measures that satisfy the bounded variance
condition of (7.5.1). The basic trellis source coding system is again presented in
Fig. 7.10 with the only difference here being that the source and representation
alphabet is the real line, 2 = (— 00, 00), and the distortion measure is possibly
unbounded.
Following the same discussion which led to (7.4.7), we have
dy (u, v(u)) < dy, (u, v*) (7.5.25)
where y* is a trellis decoder output sequence that is selected by finding v**, the
minimum-distortion path sequence of the corresponding forbidden trellis code.
Then defining % = {i: v* is a branch output vector of the all-zeros state path} we
have again (7.4.11)
Ls
Ldy ,(u, v*) nf >. d,,(u;, v*)
i=0
= Yd,(u;, vt) + ¥ d,(u;, v¥) (7.5.26)
i¢z ier
For i ¢ &, recall that d,(u, v*) = d,(u, v**) and so
Ldy,(u, v*) = > d,(u,, vF*) + > d,(a;, v¥)
i¢ Z ie x
| gta |
< dds (u;, v¥*)+ > d,(u;, v*)
ie ®
L-1
= 20 n(U;, Voi) + > d,(u u;, vi
i=0 ie x
or
dy, (u, v*) < dy, (u, Vo) >) d,(u;, v* (7.5.27)
Licz
since for the forbidden trellis code dy (u, v**) < dy,(u, Vo), where Vo is the all-zeros
state output sequence of the forbidden trellis code.
Thus, for any trellis code, output sequence u, and the corresponding forbidden
trellis code, we have from (7.5.25) and ee the bound
dy (u, v(u)) < dy, (U, Vo >) d,(u;, V*) (7.5.28)
Lice
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 429
The only difference between this construction and that used in Sec. 7.4 is that we
now consider only trellis codes where all branch outputs of the zero state are zero.
This does not change v, of the forbidden trellis code but does imply that
d,(u;, ¥*) = d(u;, 0) for all ie 2%. Hence
1
dy (u, v(u)) < dy, (u, Vo) + : > d,(u;, 0) (7.5.29)
ie Zz
Let us define the indicator function
1 ie Ff
0(Z) = 0 i¢e (7.5.30)
Then
1 Bons |
dy ,(u, v(u)) = dy, (u, Yo) 4g L > d,(U; , 0)0(Z) (7.5.31)
i=0
Now averaging over all source sequences u and Vv, of the forbidden code we have
for the given trellis code
1 i Sane Geet
dy,(u, vu) <D(P) +> Y |
=O *-—@
‘2 Ov,(u)Pw,(V¥o|u)d,(u;, 0)®,(2) du dv,
(7.5.32)
where
DIP)=[ | Qu)Plv|updlu, v) de du
ae a
The second term can be further bounded using the Hdlder inequality and the
bounded variance condition of (7.5.1).
te
fe @)
ia Ov ,(u)Px,(Vo|u)d,(u;, 0)P(2) du dv,
<
i i Qy,(u)Py,(Vo|u)[d,(u;, 0)? p= iyo}
ro. @)
x if a J Qna(w)Pa, (v0 0}(2) du iv]
2 if Ve Ov, (u)(d,(u, 0))? ia)"
20
1/2
x le i Oy ,(u)Py, (Vo |u)®(Z) du Na)
<a [e ae fe On, (u)Pw, (Vo |u)®(Z) du at 73:33)
E
4
430 SOURCE CODING FOR DIGITAL COMMUNICATION
Thus combining (7.5.32) and (7.5.33) we have the bound
dy,(u, v(u)) < D(P) + ; 4 : ip ge ae _Ow.(u )Py,(Vo|u)®(Z) du vol
< D(P) + do f ss ie oS ie Oy,(u)Py,(Vo | u)®(2) du iyo}
(7.5.34)
We now consider an ensemble of trellis codes that have zero branch vectors
on the all-zeros state path, and on all nonzero state branch vectors have indepen-
dent identically distributed random variables with common probability density
{P(v), — 0 <v < oo}. Proceeding as in the proof of Lemmas 7.4.1 and 7.4.2, we
obtain
fava) <D(P)+ do] SY Pa) (7835)
where P ,, is defined in (7.4.9) and bounded by
Pi, ea VK— 1)p) —k[E,(p, P)/R- p] (7.5.36)
where in this case
foe) fo) L+p
E,(p, P) = —In | I! P(v)Q(u|v)'/Cr” i du —-1<p<0
Since P;, depends on the forbidden trellis code which is the same as in Sec. 7.4, the
bound (7.5.36) follows from the proof of Lemma 7.4.2 when sums of probabilities
are replaced by integrals of probability densities. Thus we have finally, as in
(7.4.25),
do Q(1/2)(K — 1)p
[1 Ve 27 ae ENR ab?
dy, (u, v(u)) < D(P) + (7.5.37)
and have established the trellis source coding theorem for continuous-amplitude
memoryless sources.
Theorem 7.5.3: Trellis source coding theorem Given any fidelity D, constraint
length K and rate R = (In q)/n > R(D) for some q and n, there exists a trellis
code T, with average distortion d(T,) satisfying
, do Bet 1)LR - R(D)1/2C.
d(T) <
(7.5.38)
—[R— R(D)]2/2RC,]2
"T= ‘|
Proor Having established the bound (7.5.37), the proof follows the proof of
Theorem 7.4.1.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 431
7.6 EVALUATION OF R(D)—DISCRETE MEMORYLESS
SOURCES*
For discrete memoryless sources the rate distortion function R(D) is given in
(7.2.53). This definition is analogous to that of channel capacity in channel coding
theory where the channel capacity is the maximum rate below which the random
coding error exponent E(R) is positive. In the source coding theorem, R(D) is the
minimum rate above which the exponent E(R, D) is positive. Analogous converse
theorems also exist. Thus both channel capacity and rate distortion functions are
defined in terms of extreme values of average mutual information over some
constrained space of probability distributions; hence, it is not surprising to find
that techniques for evaluating rate distortion functions are similar to those for
finding channel capacity. In fact, it is not surprising as a result that, while channel
capacity was shown to be the maximum average mutual information, R(D)
appears as a minimum of average mutual information subject to the distortion
measure constraint. In App. 3C, we presented a simple computational algorithm
for channel capacity. A similar algorithm can be used to find R(D) and this is given
in App. 7A.
We now examine ways of finding the rate distortion function for various
sources and distortion measures. First we examine some properties of R(D). Note
that in general (see Prob. 1.7)
<In A (7.6.1)
where A is the alphabet size of the discrete memoryless source and
I(P)>0
Hence we have the bound
0<R(D)< H(¥%)<In A (7.6.2)
Let us next examine the range of values of D for which R(D) exists. Recall from
Sec. 7.2 that
Cee !P(w|u): D(P) = > ¥ Q(u)P(v|u)d(u, v) < D|
is a nonempty closed convex set for D > D,,,,, where
Din = >, Q(u) min d(u, v) (7.6.3)
* May be omitted without loss of continuity.
432 SOURCE CODING FOR DIGITAL COMMUNICATION
is the minimum possible average distortion. For D < D,,;,, R(D) is not defined.
Since I(P) is a continuous, real-valued function of P, it must assume a minimum
value in a nonempty, closed set and therefore, R(D) exists for all D > Dy.
Let D.ax be the least value of D for which R(D) = 0. This is equivalent to
finding the conditional probability {P(v|u)} satisfying
D(P) = © Y Q(u)P(v|u)d(u, v) < Dinax (7.6.4)
and for which I(P) = 0. But 1(P) = 0 if and only if Y and Y are independent (see
Lemma 1.2.1 given in Chap. 1); that is
P(v|u)=P(v) forallue@vev (7.6.5)
Hence
D(P) = ¥ Plo) Y Olu)du, v) (7.6.6)
which we can minimize over {P(v)} to obtain
Drax = Min Y Q(u)d(u, v) (7.6.7)
where the minimizing {P(v)} is zero everywhere but at the value of v which mini-
mizes d(u, v) [see (7.6.10)]. From this we see that R(D) is positive for Dyin < D <
Dimax- R(D) is clearly a nonincreasing function of D since D, < D, implies Pp, <
Py, which in turn implies R(D,) > R(D2). Otherwise, the most important
property of R(D) is its convexity, which we state in the following lemma.
Lemma 7.6.1 For D,,i, < D: < Dmax> Din < D2 < Dmax, and any0<4< 1
R(@D, + (1 — 0)D,) < OR(D,) + (1 — @)R(D2) - (7.6.8)
PRooF Let P, € Ap, and P, € Pp, be such that R(D,) = I(P,) and R(D2) =
I(P). Then since OP, + (1 — 0)P2 € Pop, +(1-)p, USing the convexity of I( - ),
we have
R(@D, + (1 — @)D2) = min I(P)
< 1(6P, + (1 — 0)P,)
< 61(P,) + (1 — O)I(P2)
ee OR(D;) ua (1 sat 0)R(D2)
Thus R(D) is a convex U, continuous, strictly decreasing function of D, for
Dinin < D < Dmax- The strictly decreasing property of R(D) further implies that if
Pe P, yields R(D) = I(P) then D(P) = D; that is, the minimizing conditional
probability that yields R(D) satisfies the constraint with equality and therefore lies
on the boundary of Y,. Figure 7.12 shows a typical rate distortion function.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 433
R(WD)
ROW RD) eA)
D ~ Figure 7.12 A typical rate distortion
function.
R(Dmin) < H(%) = — ), Qu) In Q(u) (7.6.9)
The only conditional probabilities {P,,;,(v|u)} that yield R(D,,;,) = I(Pmin) are
Prrin(v | u) = | : : : i (7.6.10)
where v(u) satisfies
d(u, v(u)) = mi d(u, v)
Here
R(Dinin) = (Prin)
= > Q(u) In Prin(v(u)) (7.6.11)
where
Prrin(v) = 2, Q(U)Pmnin(v |¥)
From this it is clear that, for the condition v(u) ¥ v(u’) for u#u’, we have
Pnin(0(u)) = Q(u) and R(D,,in) = H(%). This condition is typical for most cases of
interest when the number of letters in Y, B, is greater than or equal to the number
of letters in %, A.
We now find necessary and sufficient conditions for the conditional proba-
bility distribution P € Ap that achieves R(D) = I(P). We seek to minimize
P(v|u)
P(v|u')Q(u’)
1(P) = YY Q(u)P(v|u) In , (7.6.12)
434 SOURCE CODING FOR DIGITAL COMMUNICATION
with respect to the AB variables {P(v|u): v e VY, u € Z} subject to the constraints
P(v|u) >0 forallue@Wvewv (7.6.13)
2s) =1 forallue® (7.6.14)
YY Q(u)P(v|u)d(u, v) = D --H(7,6,15)
Without the inequality constraints (7.6.13) this would be a straightforward
Lagrange multiplier minimization problem. We proceed initially as if this were the
case and let {a(u):u e¢ %} and s be Lagrange multipliers for the equality con-
straints (7.6.14) and an 6.15), respectively, and consider the minimization of
JP, 4 s)= iP 2a gel 8.2.2, 08) P(v|u)d(u, v) (7.6.16)
but keeping in mind ultimately the requirement of the inequality constraints
(7.6.13). We find it convenient to define
A(u) = eM for allue W (7.6.17)
so that (7.6.16) can be written as
P(v|u)
s)= d 2 Q(u)P(v | u) In A(u)P(vjea
(7.6.18)
We now assume that d and s are fixed (later we choose them to satisfy the equality
constraints) and we find conditions for the minimization of J(P; 4, s) with respect
to the AB variables {P(v|u)} subject only to the inequality constraints : 6.13).
Since 1(P) and thus J(P; 4, s) are convex U functions of {P {P(v|u)}, a local
minimization of J(P; i, s) is an absolute minimization. We use this in proving the
following theorem.
Theorem 7.6.1 A necessary and sufficient condition for the AB variables
{P*(v|u):v € V,u € UY} to minimize J(P; i, s) of (7.6.18), subject only to the
inequality constraint (7.6.13), is that they satisfy
P*(olu)= Alu)P*(vje"™" — if P*(v) > 0 (7.6.19)
and
> A(ulO(ule.”" << 1 |. if P*(p) 0
where
P*(v) = > P*(viujQ(u) ~— for allve W
PROOF (Sufficiency) Let {P*(v|u)} satisfy conditions (7.6.19). For any € > 0,
taking a variation en(v|u) about P*(v|u) such that
P*(vju)t+ e(vju)>O0 = forallueWMvev (7.6.20)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 435
and defining
nv) => Q(un(v|u) ~— for allue W
we have
AJ(e) = J(P* + «ey; A, s) — J(P*; A, s)
ee ee + anlar)
2 2 Ou)P*(o|u) In Ps [uy(PF(v) + en(o))
Tom ae P*(v|u) + en(v|u)
ir d 2 Q(u)en( | )1 eae _ en(v))e™ v)
| (7.6.21)
The first term in (7.6.21) is?®
— 1+ en(v|u)/P*(v | u)
EE OwPrte|ay in | ane hr
“Sy oupprto|an [meld _ ent)
EL OwPlo pray — peg) £0)
= O(c) (7.6.22)
while the second term is
E Aas hah P*(v|u) + en(v|u)
dd O(u)n( | )1 re % en(v) )A(uee™ 5
P¥(v|u)
= x 2 Q(u)n(v|u) In pee ; + O(e’)
uv: P*(v)>0
eX Otway mn | ela
since for P*(v) > 0 we have
P*(v|u)
P*(v)A(uet™ 7
by (7.6.19), which is true by hypothesis. Hence
A)=€L YF Qwyn(v|u) n an ie | +012) (7.6.24)
uv: P*(v)=0
'? The term O(x) is proportional to x.
436 SOURCE CODING FOR DIGITAL COMMUNICATION
Now using the inequality In x > 1 — (1/x) [see (1.1.6)], we have
J(e)>e 2 : _ 2tunlo |x) [ ua ae foe)
ae Fell -zaeniens]s.o%%
v: oy Fy
> O(€’) (7.6.25)
since by hypothesis for ni == 0)
» O(u)A(ujesa” < 1 and = n(v) > 0
Hence
i J(P* + en; 2, s) — J(P*; A, s)
rane) € €|0 €
>0 (7.6.26)
which assumes a local minimum at P*. By convexity of J(P; 4, s), this must be
an absolute minimum.
(Necessity) Let P* minimize J(P; 4, s) subject to the inequality con-
straint (7.6.13). From above we have for any ¢ > 0 and numbers {P*(v|u)}
such that P*(v|u) + en(v|u) > 0, for allue Mvev
AJ(e) = J(P* + en; 4, s) — J(P*; 2, s)
=e) YY Qu)n(v|u) In
v|u
Peter
uv: P*(v)>0
¢ sdlblei les n(v|u) Ole
+ dA! )n(v|u) In ouwe™ + O(c?) (7.6.27)
First let us choose n(v|u) = 0 for all v where P*(v) = 0. Then
K)=€L YF Oun(v|u) in
u_ iv: P*¥(v) > 0
P*(v|u)
P*(v)A(u)ese™”
| + O(c?) (7.6.28)
where n(v|u) can be any set of positive or negative numbers as long as
P*(v|u) + en(v|u) > 0. Hence, for AJ(c) > 0 for arbitrarily small ¢ > 0, we
require
P*(v|u) = P*(v)A(we™” if P*(v) > 0
Suppose next in (7.6.27) we choose
n(v|u) = : n(v)A(uese Poe (7.6.29)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 437
Then
AJ(e) 7S » >, Q(u)n(v|u) In Is aan 7 O(c?)
u_iv: P*(v)=0
Se Ue toa » O(u')A(u’)ere + 0(2) (7.6.30)
v: P*(v)=0 u’
Since n(v) > 0 when P*(v) = 0, in order for AJ(e) > 0 for all € > 0, we require
2 O O(u)A(ujesa”) <1 if P*(v) =0
To find necessary and sufficient conditions for P* € Y, that yield R(D), in
addition to (7.6.19) we need only choose Lagrange multipliers 4 and s to satisfy
the equality constraints (7.6.14) and (7.6.15). Hence from (7.6.14) and (7.6.19) it
follows that A is given by
rage
A(u) = (> P(v)e™ >| for all ue &@ (7.6.31)
It is more convenient to keep s as a free parameter and express the distortion
D = D, and rate distortion function R(D) = R(D,) in terms of s.
Theorem 7.6.2 Necessary and sufficient conditions for a conditional probabil-
ity {P(v|u)} to yield the rate distortion function R(D) at distortion D are that
the conditions of Theorem 7.6.1 be satisfied, where Lagrange multipliers
satisfy (7.6.31) and s satisfies the sce equations
Do= d d Al A(u (ve 'd(u, v) (7.6.32)
and
R(D,) = sD, + >> Q(u) In A(u) (7.6.33)
PRoor We need only to use P(v|u) = A(u)P(v)e“” in D = D(P) and
R(D) = I(P) to obtain (7.6.32) and (7.6.33).
Although this theorem gives us necessary and sufficient conditions for the
conditional probabilities that yield points on the R(D) curve, actual evaluation
of R(D) is difficult in most cases. Usually we must guess at a conditional prob-
ability and check the above conditions. There are, however, a few relationships
which are helpful in evaluating R(D).
Lemma 7.6.2 The parameter s in (7.6.32) and (7.6.33) is the slope of the rate
distortion function at the point D = D,. That is,
dR(D)
R’'(D,) = es ae =S§ (7.6.34)
438 SOURCE CODING FOR DIGITAL COMMUNICATION
Proor The chain rule yields the relation
dR OR ae OR dau)
| +2 0A(u) dD
Using (7.6.33) we have
Recall that for P(v|u) > 0 we have
P(v|u) = A(u)P(v)e™ ”
Multiplying by Q(u) and spe over u € & gives the relation
2 At A(u)Q(ujesa™ ” = 1
when P(v) > 0. Differentiating with respect to s yields
ei (2(u)atu, v) + ot ote 2H
u
Multiplying by P(v) and summing over v € V ote
7 i oe yest) d(u o) + 2 Olu) i 5 Piejets =
The first term is D, = D and
which yields the relationship
Q(u) da(u)
rs we A(u) ds se
Hence from (7.6.36) it follows that R’(D,) = s.
(7.6.35)
(7.6.36)
(7.6.37)
(7.6.38)
(7.6.39)
(7.6.40)
(7.6.41)
Since R(D) is a decreasing function of D for Di, <_D < Dyax, this lemma
implies that the parameter values of interest satisfy s < 0. We next show that the
slope of R(D) is also continuous in this range.
Lemma 7.6.3 The derivative R’(D) is continuous for Dyin < _D < Dmax-
ProoF Let D,,;, < D* < D,,,, and consider the parameters
s_= lim R'(D)
D; 1.D*
(7.6.42)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 439
and
s, = lim R’(D) (7.6.43)
D | Dt
These are defined since R(D) is a continuous, convex vu function of
Darin < D < Dyrax- We let P,. and P_ be corresponding conditional probabili-
ties. By continuity of R(D) we have
R(D*) = I(P,) = I(P_) (7.6.44)
For any 0<@<1 let P,=O0P,+(1-—90)P_. Certainly P, satisfies
D(P,) = D* so that
R(D*) < I(P5)
< 61(P..) + (1 — O)I(P_) (7.6.45)
The second inequality follows from convexity as proved in Lemma 1A.2. Since
I(P..) = 1(P_) = R(D*), we have
R(D*) < I(P,) < R(D*) (7.6.46)
Thus we must have equality in each of the above steps. On examining the
proof of Lemma 1A.2 in App. 1A for P,(v) > 0, we have
P,(vju) _ P+(v|u) _ P_(v|u)
Pov) Ps(v) — P_(v)
Fa A, (ujes+4™ v)
rm 7 &. (u)es-“™ v)
or
As (u) — glS+—s_)d(u, v) (7.6.47)
Here 4,(u) and A_(u) are the A(u) corresponding to P,(v|u) and P_(v|u),
respectively. Since v does not appear on the left side, either s, =s_ or
d(u, v) = d(u), independent of v. If d(u, v) = d(u), then
P,(vo[u) =A, (uje**P , (v) (7.6.48)
or summing over all » where P,(v) > 0
1 = A,(u)e*+™ (7.6.49)
Hence
P,(vo|u) = P,(v) (7.6.50)
and consequently D* > D,,,, since R(D*) = I(P., ) = 0. But since D < D,,,, we
conclude that s, =s_.
440 SOURCE CODING FOR DIGITAL COMMUNICATION
It has been shown further (Gallager [1968], Berger [1971]) that R’(D) goes to
— oo as D approaches D,,;,, and that the only place a discontinuity of R’(D) can
occur is at D = D,,,,. We next derive another form of R(D) which is useful in
obtaining lower bounds to R(D).
Theorem 7.6.3 The rate distortion function can also be expressed as
R(D)= max sp + ¥ Q(u) In 10) (7.6.51)
where
= tu): YY A(ujQ(ujes”) < | (7.6.52)
Necessary and sufficient conditions for s and 4 to achieve the maximum are
the same as those given in Theorem 7.6.2.
ProoF Let s <0, € A,, and P € Y,. Then using eee have
I(P) — sD — 2 Ou ) In A(u) > 1(P) — sD(P — 2, 2, Olu) P(v|u) In A(u)
ne P(v|u)
= 2 2 Q(u)P(v|u) In Pr)auyer9 (7.6.53)
Again using the inequality In x > 1 — (1/x), we have
P(v)A(ujes™ ”
I(P) — sD ~ ¥ Qu) in Au) = YE OlupPCo|u)ft —
=1 = Ple) Y Awyatujes”
>1-—Y¥ P(v)
=()
Hence for each P € Y, we have
I(P) )=sD + ) Qlu) ) In A(u)
and clearly
I(P)> max |sD+ d Q(u) In A(u) (7.6.54)
s<sO,AEAs5
But from Theorem 7.6.2 we know that there exists a P* € Ap, s* <0, and
X.* € A,, such that
R(D) = I(P*) = s*D + ¥ Q(u) In *(u) (7.6.55)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 441
Hence
R(D)= max |sD + ¥ Q(u) In A(u)
s<0,AeEAS5
We now examine a few examples. It will be clear that even for simple cases,
unless certain symmetries hold, it is difficult to evaluate the rate distortion func-
tion. Fortunately, there is a very useful computational algorithm available for
computing rate distortion functions as well as channel capacities. In App. 7A we
present this algorithm, which is due to Blahut [1972].
Example (Binary source, error distortion) Consider the simple binary source, error distortion
case where @ = ¥ = {0, 1}, O(0) = q <3, O(1) = 1 — g, and d(u, v) = 1 — 6,,. To find R(D), first
we observe that D,,;, = 0 and
Dyrax = Min {qd(0, v) + (1 — g)d(1, v)} =
U
Also for this case we see that
R(0) = #(q) = —qInq—(1—q)In (1 —q) (7.6.56)
We now find R(D) for 0 < D < q. Clearly, if for any P € 7, we have P(0) = 0, then D(P) = q; and
if P(1) = 0 it follows that D(P) = 1 — q > q. Hence for 0 < D < q we must have, for any P € Pp,
the condition P(0) > 0 and P(1) > 0. The conditional probabilities that achieve the rate distor-
tion function must then satisfy
P(v|u) = A(u)P(v)e™”
Multiplying by Q(u) and summing over u € W@ = {0, 1} yields equations
A(0)ge* + ACMA — q)=1
A0)q + A(1)(1 — q)e* = 1
which have solutions
A(O) = —__—~
(0) q(1 + e°)
(7.6.57)
A(1) =
Li = qht+<)
Now we attempt to find P(0) and P(1) of the optimum conditional probability in A,. From
(7.6.31)
4(0) =
1
P(0) + P(1)e*
1
(O)e* + P(1)
(7.6.58)
A(1) = =
which combined with (7.6.57) yield equations
q(1 + e*) = P(O) + P(i)e®
(1 — q)(1 + e*) = P(O)e* + P(1)
442 SOURCE CODING FOR DIGITAL COMMUNICATION
yielding solutions
i-¢-
ae 1-—e*
This then gives the parametric equation for D = D,
D, = YY A(u)Q(u)P(vJe* d(u, v)
= 2(0)Q(0)P(1)e* + A(1)O(1)P(O)e*
. e*
alee
Hence the Lagrange multiplier s must satisfy
(
1 D
Now we have for R(D), using (7.6.33)
R(D) = sD + ¥ Q(u) In A(u)
-Din(—=)+ In (=
eo ee eae
=+H(q)-#(D) O0<D<q
(7.6.59)
(7.6.60)
(7.6.61)
(7.6.62)
Note that since #(q) < #(4) = 1, the rate distortion function for a binary
symmetric source requires the highest rate of any binary source to achieve a given
average distortion D. This is expected since there is greatest uncertainty in the
outputs of the binary symmetric source. The natural generalization of this
example will be examined next. Except for a very special case of this. next example,
the rate distortion function seems too complex to merit detailed presentation here
(see, however, Berger [1971]). Instead we use Theorem 7.6.3 to find a lower bound
to R(D).
Example (Error distortion) Consider the natural generalization of the previous example where
we are given alphabets W = ¥ = {1, 2,..., A}; source probabilities Q(1), Q(2), ..., Q(A); and
distortion measure d(u, v) = 1—6,,. Rather than derive the exact expression for R(D), we
develop an important lower bound to it. Recall from Theorem 7.6.3 that
R(D) > sD + ¥° Q(u) In A(u)
for any s <0 and A(1), A(2), ..., A(A) that satisfy
y Mk)Q(ket*) <1
k=
Suppose we choose A(k)Q(k) to be a constant for k = 1, 2,..., A and require
¥ Akjo(Ke? = 1
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 443
Then
1
A(k)Q(k) = eal (7.6.63)
Now choose
D
s=In laa aia (7.6.64)
and so
A(k)Q(k) =1-—D oe Te eg (7.6.65)
For this choice of s <0 and Ae A,
R(D) > D In
D A 1-—D
2 =e a ON si hen ae
— H(%) — #(D) — D in (A —1) (7.6.66)
Note that for A =2, our previous example, this lower bound gives the exact
expression for R(D). Also for the special case where Q(1) = Q(2)=-:-=
Q(A) = 1/A we can easily check that P(1) = P(2) = --- = P(A) = 1/A, with s and 2
chosen above, satisfy the necessary and sufficient conditions, and again this lower
bound is the exact rate distortion function. It turns out, in fact, that for
0 <D<(A—1) min Q(k)
this lower bound is the exact rate distortion function for the general case. For
(A — 1) min Q(k) < D <D,,,,
where
| Wea = 1 — max Q(k)
k
the exact form of R(D) is more complex and the lower bound is no longer tight
(see Jelinek [1967]).
In the above example, for equally likely source outputs we had a symmetric
condition which rendered easy the determination of the exact rate distortion
function. This case is a special example of a class of sources and distortion meas-
ures referred to as symmetric sources with balanced distortion.
Example (Symmetric source and balanced distortion) Given ¥ = ¥ = {1, 2, ..., A} and equally
likely source probabilities Q(1) = Q(2) =---= Q(A) = 1/A, suppose the distortion matrix
{d(k, j)} has the same set of entries in every row and column. That is, there exist nonnegative
numbers d,, d,,..., d, such that
talk, 3); 9 = 1.2, 5.5 Al = ibe dase Ge - fork = 1, 2,...,A
and
DNK, i ke 3.2... Are id, d,, -...4d3 forj=—12,...,A
444 SOURCE CODING FOR DIGITAL COMMUNICATION
In this case {d(k, j)} is called a balanced distortion matrix, and we now compute the exact rate
distortion function. By symmetry, we guess that P(1) = P(2)=-:: = P(A) =1/A and A(1) =
A(2) = -:: = A(A). We now check to see if the necessary and sufficient conditions of Theorem 7.6.2
are satisfied for this guess. The conditional probability must satisfy
A
P(j|k) = - gD for all j, k (7.6.67)
and from (7.6.31)
A
> ex4k
k=1
This conditional probability satisfies the conditions (7.6.19) with the required 4 value. The rate
distortion function is given in parametric form by (7.6.32) and (7.6.33) which reduces to
A
> ie
k=1
D=D,=— (7.6.69)
ye
k=1
and
A
R(D,) = sD,+ In A—In ( by en (7.6.70)
k=1
The symmetric source with balanced distortion example suggests a general
way of obtaining a lower bound to R(D) for arbitrary discrete memoryless sources.
Lemma 7.6.4 A lower bound to the rate distortion function for a discrete
memoryless source with entropy H(%) is given by
R(D) > R,,(D) = sD + H(%) — In (y edu, -) (7.6.71)
where v* satisfies
Sn Ss BR ee (7.6.72)
u Vev =u
and s satisfies the constraint
1460
u
ey esdtu, v*)
u
D= (7.6.73)
ProoF Choose {A(u): u € %} such that
e
Q(u)A(u) = > elo) (7.6.74)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 445
Then
> esdu, v)
> A(u)Q(ujes? = Xe ot (7.6.75)
and thus 4 € A,. From Theorem 7.6.3 we have, for this choice of 4 € A,
R(D) > sD + H(% -in(y est, 4
for any s <0. We now choose s to maximize this lower bound. By direct
differentiation of the lower bound with respect to s and setting the derivative
to zero, we find that s must satisfy (7.6.73).
Evaluation of R(D) requires finding P € #, that satisfy the necessary and
sufficient conditions given in Theorem 7.6.2. Except for examples with certain
symmetry properties this is difficult. Using the lower bound, which is often tight
for small values of D, is a convenient way to find an approximation to R(D).
Another approach to evaluating R(D) for a specific example is to use the computa-
tional algorithms of App. 7A.
7.7 EVALUATION OF R(D)—CONTINUOUS -AMPLITUDE
MEMORYLESS SOURCES*
The conditions for the evaluation of R(D) for continuous-amplitude memoryless
sources are similar to those for discrete memoryless sources. Recall that the rate
distortion function is defined by (7.5.23), (7.5.18), and (7.5.21) as
R(D)= inf I(P) __ nats/source symbol
Pe Pp
where
and
oo5 Pte |u): DIP)={ | Qlu)P(v|u)d(u, v) do du < D
ome <4 or RD
As with discrete sources, R(D) is a continuous, strictly decreasing function of D for
Dain < D < Dorax, Where here
D,i, = | Spray sata) ah (7.7.1)
= <a
* May be omitted without loss of continuity.
446 SOURCE CODING FOR DIGITAL COMMUNICATION
and
D. - taf | = Ohaus) Wu ‘ees
v a
The strictly decreasing property of R(D) further implies that if P € Ap yields
R(D) = I(P), then D(P) = D.
To find R(D), we want to minimize
y= JF 2twmco|u) i
as) ie
Pe lw | dv du (7.7.3)
subject to the conditions on P(v|u)
P(vju)>0 = foralluuve® (7.7.4)
(o.@)
| P(vlu)dv=1 forallue@ (7.7.5)
| | Q(u)P(v|u)d(u, v) du dv = D (7.7.6)
Using Lagrange multipliers for the equality constraints, (7.7.5) and (7.7.6),
and the calculus of variations, we can obtain the continuous-amplitude form of
Theorem 7.6.1, given in the following theorem (see Berger [1971], chap. 4).
Theorem 7.7.1 Necessary and sufficient conditions for a conditional probabil-
ity P € PY, to yield the rate distortion function R(D) at distortion D are that it
satisfy?°
P(v|u) = A(u)P(v)e™” ~—if P(v) > 0 (CAAA)
[ AwQluje dust — if P(o)=0 (7.78)
where
= | | : Pee” do} (7.7.9)
and where for s < 0, R(D) and D satisfy the parametric equations
Be fe f. A(u)O(u)P(v)e"d(u, v) du do (7.7.10)
R(D)= sD + Q(u) In A(u) du (7.7.11)
oD
Following the same arguments as for the discrete case we have the following
lemmas.
20 In a strict sense, these relations hold for almost all ve VW.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 447
Lemma 7.7.1 The parameter s is the slope of the rate distortion function at
the point D = D,. That is
iN (7.7.12)
Lemma 7.7.2 The derivative R’(D) is continuous for D,,;, < _D < Diax-
Theorem 7.7.2 The rate distortion function can be expressed as
R(D)= sup
s<O,1€EAsS
sD+[ Q(u) In Alu) iu (7.7.13)
where
fe tu: a A(u)Q(ujes*” <1, —w<v< | (7.7.14)
Necessary and sufficient conditions for s and i to realize the maximum are
the same as those given in Theorem 7.7.1.
The main difference between the rate distortion functions for continuous and
discrete sources is that R(D) > oo as D> D,,,,,, since the entropy of a continuous
amplitude source is infinite. For continuous amplitude sources there are only a
few examples of explicit analytical evaluation of the rate distortion function. We
present first the well-known, most commonly used example of a memoryless
Gaussian source with a squared-error distortion measure.
Example (Gaussian source, squared-error distortion) Consider a source that outputs an indepen-
dent Gaussian random variable each symbol time with probability density
1 2 2
Q(u) = —e “le —0o <u<o (7.7.15)
./2no?
and assume a squared-error distortion measure d(u, v) = (u — v)*. For this distortion and source
we have D,,,,, = 0 and D,,,, = o*. We next seek a conditional probability density P ¢ A, which
satisfies the necessary and sufficient conditions of Theorem 7.7.1 for 0 < D < a”. A natural choice
is to choose, for some f?
1
./ 2np?
This then yields the Lagrange multiplier, A( - ), which satisfies
— v2/2p2
P(v) =
e —-0o<v<0 (7.7.16)
[A(u)]-! E> | 8 P(v)es™ ”) dp
uh
ia | P(v)e~ “7/22 dy
re
eee (7.7.17)
5 1
= ./2aa Jala? +
448 SOURCE CODING FOR DIGITAL COMMUNICATION
where a? = — 1/(2s). This choice of P(v) then requires P(v|u) of the form
P(v|u) = A(u)P(v)e™ ”
= A(u)P(v)e~ “~ °/28?
o-Be+ PM) aay
z
oP 292 B?/(a? + 6?)
1
~ /2nfa?p?/(o? + B?)]
All that remains is to satisfy the parametric equations for D and R(D). First,
D= wy O(u)P(v|u)d(u, v) du dv
a? B? "i ( a2 ) ‘
= ——, —_—|o
a2 ae p? a2 + p?
So far a? is directly related to the parameter s, whereas f? is unrestricted. We choose f? to satisfy
a? + B? = o?. This forces the relation on s given by
Dan
a (7.7.19)
as ae
The expression for R(D) then becomes
R(D) = sD + | Q(u) In A(u) d
1 , u?
bisa R GIT: a pani
—-. 2 F owls In 2 wy) du
or
2
R(D) =4 In 5 nats/source symbol 0<D<o? (7.7.20)
The above is the simplest example. We next present without proof other
known examples.
Example (Gaussian source, magnitude error distortion) Consider next the same Gaussian source
with probability density given by (7.7.15) and assume now a distortion measure d(u, v) =
|u — v|. Here D,,;, = 0 and D,,,,, = \/207/n. For 0 < D < D,,,,, the rate distortion function (Tan
and Yao [1975]) is given parametrically by
max?
R(D) = -\(; + a — 20(0)) + reer — In (29(0)) (7.7.21)
pi hin a? ye ge cadad
D,= Bae 29(0))e""7Q(0) + /2/ne*? — 200(0), (7.7.22)
where 0 < 6 < oo. Similar analytical evaluation of rate distortion functions for classes of sources
of probability densities with constrained tail decays under magnitude error distortion are also
given in Tan and Yao [1975]. [In this example only Q(-) is the Gaussian integral function
defined by (2.3.11).]
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 449
Example (Exponential source, magnitude error distortion) Suppose the source probability den-
sity is
Q(u) = petal ~wo<u<o (7.7.23)
with a distortion measure d(u, v) = |u — v|. Then D,;, = Oand D,,,, = 1/«. For 0 < D < 1/a, the
choice (Berger [1971])
P(v) = a2D? 5(v) + 5 — o2D?)e 2" (7.7.24)
yields the rate distortion function
R(D)= —In (aD) _nats/source symbol (7.7.25)
Example (Uniform source, magnitude error distortion) Consider a source with uniform proba-
bility density
ji/(2A) |u| <A
7.7.26
\o |u| > A ( )
Q(u) =
and a distortion measure d(u, v) = |u — v|. Then D,,;, = Oand D,,,, = A/2. For 0 < D < A/2, we
have (Tan and Yao [1975])
R(D) = —In [1 — (1 — 2D/A)"/?] — (1 — 2D/A)!/ (7.7.27)
Finally we note that Rubin [1973] has evaluated the rate distortion function
for the Poisson source under the magnitude error distortion criterion. Evaluations
of rate distortion functions for most other cases are limited to a low range of
distortion values, wherein often a simple lower bound to the rate distortion
function coincides with the actual rate distortion function.
Since R(D) is generally difficult to evaluate, it is natural to consider various
bounds on the rate distortion function. Upper bounds follow easily from the
definition, since
R(D) = inf I(P)
Pep
< I(P)
for any P € Y,. The trick is to choose a convenient form for P € 7). Often, for a
given distortion measure, there is a natural choice for the conditional probability
density that yields a simple, convenient upper bound. For example, for the
squared-error distortion d(u, v) = (u — v)’, a natural choice was to let P € Pp be
the Gaussian density.
Theorem 7.7.3 Let Q( - ) be any source probability density with mean zero
and variance o”. That is, suppose
| uQ(u) du =0 (7.7.28)
450 SOURCE CODING FOR DIGITAL COMMUNICATION
and | u2Q(u) du = 0? (7.7.29)
rs ©
For this source probability density and the squared-error distortion measure,
d(u, v) = (u — v)’, the rate distortion function is bounded by
2
R(D) <4 In = nats/source symbol O<D<o? (7.7.30)
where equality holds if and only if Q(- ) is the Gaussian density.
ProoF For a given D in the interval 0 < D < a’, let
1
P fan —[v—(1 — D/a2)ul2/2D(1 — D/a2) oS ee
ais) J 2nD(1 — D/o?) ( )
Then
00 D D 2
| d(u, v)P(v|u) v= D(1-%)+ ("| u?
and
D(P) = | fe d(u, v)Q(u)P(v|u) dv du
mae,” © — oS
=D (7.7.32)
Hence P € Y, and we have R(D) < I(P). But
iP) = iG i Q(u)P(v|u) In ay du
22 1 OO 00
— | iki, In P(v) dv + py | Ow )P(o|w) In P(v|u) dv du (7.7.33)
ap 1
Letting h(V)= | Pe) In Po” (7.7.34)
be the differential entropy of P(v), and noting that
i eC Q(u)P(v|u) In P(v|u) dv du
ae IE cE Q(u)P(v | “)\- E a en — - In ano( — | ! dv du
= —4-Jln 2nd 1 -5)
—s 0 [anen( 1 — | (7.7.35)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 451
it follows that
R(D) < I(P)
= h(Y’) — 4 In [2meD(1 — D/o’)] (7.7.36)
for the choice of P of (7.7.31). Note from (7.7.28), (7.7.29), and (7.7.31) that
| © PG) do LD (7.7.37)
“i
and ie v?P(v) dv = ie tu) v?P(v|u) av| du
= a Q(u){D(1 — D/o?) + (1 — D/o?)?u?' du
= (1 — D/o’)o’
=o?—D (7.7.38)
It follows also, from Prob. 1.13, that the differential entropy for any proba-
bility density is upper-bounded by the differential entropy of the Gaussian
density having the same mean and variance. Hence, using (7.7.37) and (7.7.38),
we have
h(V) <3 In [2e(c* — D)] (7.7.39)
which in turn yields the desired bound.
Thus, for a given variance, the Gaussian source yields the maximum rate
distortion function with respect to a squared-error criterion. It follows that, for
any given source of variance o? and squared-error fidelity criterion D, there exists
a block code of fixed rate R > 4 In (o7/D) nats per symbol that can achieve
average distortion D. In fact, Sakrison [1975] has shown that for R > 4 In (a7/D),
codes that are designed to achieve average distortion D for the Gaussian source
will be good (in the sense of also achieving distortion D) for any other source with
the same variance. Similar results were obtained for sources with fixed moments
other than the second.
Most of the efforts in evaluating rate distortion functions have concentrated
on deriving lower bounds to R(D). This is due in part to the fact that, for many
sources and distortion measures, a convenient lower bound due to Shannon
[1959] coincides with the actual rate distortion function for some lower range of
values of the fidelity criterion D. To derive lower bounds to R(D), we examine the
form of the rate distortion function given in Theorem 7.7.2. Specifically
R(D)= sup
s<O,AeEAs
sD + & Q(u) In A(u) iu
foe
> sD + c Q(u) In A(u) du (7.7.40)
452 SOURCE CODING FOR DIGITAL COMMUNICATION
for any s <0 and any A € A,, where
A, =
A(u): i: A(ujQ(uje” du<1, -w<v<0
Again we seek a convenient choice of Xe A,. For difference distortion meas-
ures d(u, v) = d(u — v), which depends only on the difference u — v, we have the
following lower bound, R,,(D).
Theorem 7.7.4: Shannon lower bound For a source with probability density
function Q( - ) and difference distortion measure d(u, v) = d(u — v)
h(%) + sD — In | est) i (7.7.41)
a
R(D) = R,,(D) = sup
s< 0
where
fe 0)
h(@) = — | Q(u) In Q(u) du (7.7.42)
™ ©
is the differential entropy of the source.
ProoF Let A(u) be chosen according to
oO
[Aw))' =O) |e az (7.7.43)
Then
ie ia esdtu- v) du
[ A@)Que**-? du = ==
ae | et) dz
aa]
which establishes 4 € A,. Substituting (7.7.43) in (7.7.40) yields the desired
result.
Using direct differentiation with respect to s on the lower bound (7.7.41), we
can easily obtain two special cases.
Corollary 7.7.5: Squared error For d(u, v) = (u — v)* in the above lemma we
have
R(D) > R,,(D) = h(W@) — 4 In (22eD) _ nats/source symbol _ (7.7.44)
Corollary 7.7.6: Magnitude error For d(u, v) = |u — v| in the above lemma
we have
R(D) = R,,(D) = h(@) — In (2eD) __nats/source symbol __ (7.7.45)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 453
In many cases the Shannon lower bound is tight. This occurs when A( - ) given
in (7.7.43) also meets the conditions of Theorem 7.7.1, which are satisfied if and
only if a probability density P,( - ) can be found that satisfies (see Prob. 7.18)
Q(u) = —~ (7.7.46)
for some s > 0. For these values of s, we have R(D,) = R,;,(D,).
For the Gaussian source with squared-error distortion measure, the Shannon
lower bound is tight everywhere; that is, R(D) = R,,(D) for all0 < D < a”. This is
also true of the source with the two-sided exponential density (7.7.23) with a
magnitude error criterion whose R(D) is given by (7.7.25). For a case in which
the lower bound is nowhere tight consider the following.
Example (Gaussian source, magnitude distortion) For the Gaussian source with probability
density
Bi EAs es Ci Acc (7.7.47)
/ 210
and the distortion measure d(u, v) = |u — v|, we have
h(%) = 4 In (27e07) (7.7.48)
and
R,,(D) = 4 In (x0?/2eD7) (7.7.49)
for
no?
0<D< | —
2e
Here in general the true rate distortion function [see the example resulting in (7.7.21) and (7.7.22)]
is strictly greater than the Shannon lower bound. However, by numerical calculations, Tan and
Yao [1975] have shown that the maximum of R(D) — R,,(D) is roughly 0.036 nat per source
symbol, and that at rates above one nat per source symbol the difference is less than one part in a
million. Thus one can conclude that R,,(D) is a very good approximation of R(D) for this
source (see Prob. 7.21).
7.8 BIBLIOGRAPHICAL NOTES AND REFERENCES
The seeds of rate distortion theory can be found in Shannon’s original 1948 paper.
It was another eleven years, however, before Shannon [1959] presented the fun-
damental theorems which serve as the cornerstone of rate distortion theory. In the
late sixties there was a renewed interest in this theory, and at that time the general
information theory texts by Gallager [1968] and Jelinek [1968a] each contained a
chapter devoted to rate distortion theory. The most complete presentation of this
theory can be found in the text by Berger [1971], which is devoted primarily to this
subject.
454 SOURCE CODING FOR DIGITAL COMMUNICATION
In this chapter, the presentation of rate distortion theory is different from
earlier treatments in that we first emphasize the coding theorems and later discuss
the rate distortion function, its properties, and its evaluation. The proofs of the
coding theorems for block codes (Theorem 7.2.1) and for trellis codes (Theorem
7.4.1) are due to the authors (Omura [1973], Viterbi and Omura [1974]). They are
analogous to the proofs of the corresponding channel coding theorems of
Chaps. 3 and 5. The de-emphasis of techniques for the evaluation of R(D) is due
primarily to the fact that there now exists an efficient computational algorithm for
R(D) which is due to Blahut [1972] and is included here in App. 7A.
APPENDIX 7A COMPUTATIONAL ALGORITHM
FOR R(D) (BLAHUT [1972])
The algorithm for computing R(D) is similar to the algorithm for channel
capacity given in App. 3C. Recall that for a discrete memoryless source with
alphabet %, letter probability distribution {Q(u): u € %}, representation alphabet
¥, and distortion {d(u, v): u € %, v € V} the rate distortion function R(D) is given
by (7.2.53)
where
and
The parametric representation for R(D) in terms of parameter s < 0 is given by
(7.6.32) and (7.6.33)
D, = >, Y A(u)Q(u)P(v)e™ ”d(u, v)
R(D,) = sD, + ¥° Q(u) In A(u)
where, by (7.6.31)
=
A(u) = (y P(v)es™ 4 for allue W
The transition probability {P(v|u)} which achieves R(D,) is given by the necessary
and sufficient conditions of (7.6.19)
P(v|u) = A(u)P(v)e@™” ~~ if P(v) > O
and
LAN Sof,
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 455
> Ane" <1 <<, IPia)=0
The algorithm for computing R(D) is based on the following theorem.
Theorem 7A.1 Given parameter s <0. Let {Po(v): v € Y¥} be a probability
vector for which P,(v) > 0 for all v € Y. For integers n = 0, 1, 2, ... define
P,+1(v) = P,(v) )2 Pee (7A.1)
and
edu, v)
Bat i(v|u) 3 ) — v’) (7A.2)
Then, in the limit as n > o0, we have
D(P,,) a D,
TA.
I(P,) > R(D,) Le
where (D,, R(D,)) is the point on the R(D) versus D curve parameterized by s.
ProoFr Consider the JD plane shown in Fig. 7A.1. Define V(P) = I(P) —
sD(P) which can be interpreted as the J-axis intercept of a line of slope s which
passes through the point (/(P), D(P)). Recall that the point on the R(D)-
versus-D curve parameterized by s has a tangent that is parallel to every such
line of slope s, and this point lies beneath all such lines since R(D) is defined as
a minimization over I(P). We show that V(P,) is strictly decreasing with n,
unless (J(P,,), D(P,,)) is a point on the R(D)-versus-D curve.
R(D)
oS ae D Figure 7A.1 Sketch of ID plane.
456 SOURCE CODING FOR DIGITAL COMMUNICATION
From (7A.2) we have
V(Pi+1) = 1(Pi+1) — sD(P,+1)
| P+ 1(v|u)
=>) ¥ Ou)P,+1(v|u) In Peet —s> ¥ Q(u)P,+ (v|u)d(u, v)
. P,(v)es™ v)
bs iE Pte) yes, “
os os > QO(u)P n+ i(v|u) In esdtu, v)
= a are ( Ae d Q(u) In (x Pine ») (7A.4)
v n+ v
From this we get the difference
V(P,4+1) — a: Py, ke
PB due v)
a Patolu)(3, Paleo}
P (ve v)
P,(v|u) (> Pate) jeu)
= (7A.5)
—1
where again we used In x < x — 1. We have equality in (7A.5) if and only if
fat 1(v) a P,(v) and
P ee v)
dP, alt v’)
v’
P,(v|u) =
(74.6)
which is one of the conditions (7.6.19) for the distribution that achieves R(D,).
Since V(P,,) is nonincreasing and is bounded below by R(D) — sD, it must
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 457
converge to some value V(P,,) as n— oo. The sequence P,, must have a limit
point P*, and by continuity of V(P) this limit point must satisfy (7A.1)
edu, v)
rt ”)d Prt edu, v’) (7A.7)
Thus P* satisfies necessary and sufficient conditions to achieve R(D,) so that
V(P*) = R(D,) — sD
s*
The accuracy of the computational algorithm after a finite number of steps is
given by the following theorem.
Theorem 7A.2 Given any probability distribution {P(v): v € W} let
Cay eS
Ly Pye) for allvev (7A.8)
Then for {P(v|u): ue UW, v € VY} satisfying
P(v)es™ v)
P(v | u) iis > P(v' jes v’) (7A.9)
we have at the point
D=DIP)= EE Ou) eran
the bounds :
— max In C(v) < R(D) — sD + ¥° Q(u) In » P(v)e“™ ‘
< —Y P(v)C(v) In C(v) (7A.10)
Proor If D(P) = D then P € Y, and
R(D) < I(P) :
= EE Owl) in| soe, ee a]
y y ol \P( : ) z P(vje sd(u, v)
i (E Peeve”) oP C|W)]
= sD —¥ Qu) In (> P(v)est 4
pa Q v|u
— > > QAu)P(v|u) In | = Pv) (7A.11)
u v
458 SOURCE CODING FOR DIGITAL COMMUNICATION
But
¥. Olu)P(v|u) = P(r)C(v) (7A.12)
so that
R(D) < sD — 9° Q(u) In p P(v)es™™ °| — ¥) P(v)C(v) In C(v) (7A.13)
u v v
From Theorem 7.6.3 we have
R(D) > sD + ¥° Q(u) In A(u) (7A.14)
where A is any vector such that
Y A(wQ(ujes"”" <1 = forallvev (7A.15)
Let us choose
Ay = ee > Pye ; iu for allue & (7A.16)
where
Cmax = Max C{v)
v
Then (7A.15) is satisfied and
R(D) => sD — ¥. Q(u) In » P(v)e*™” | — max In C(v) (7.17)
v
We see that, for {P(v): v €e W} that achieves the point R(D), we have
R(D) = sD — ¥. Q(u) In IY P(v)est "4 (7A.18)
and
C(v) <1 (7A.19)
with equality when P(v) > 0. Thus
—max In C(v) = —)}) P(v)C(v) In C(v)
v
=0 (7A.20)
and the bounds in (7A.10) are tight. The two theorems suggest the following
algorithm for a given ¢ > 0 level of desired accuracy.
Step 1: Set n = 1 and pick an initial probability Py . (The uniform distribution will
do.)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 459
Step 2: Compute?’ for the given Q(u), d(u, v), and for any s <0
C,(v) = > ere (7A.21)
Pas s(v) = Palv)C,(0) (74.22)
A, = —Y, P,(v)C,(v) In C,(v) (74.23)
B, = —max In C,(v) (7A.24)
Step 3: If A, — B, < 6 compute D(P,)
R(D(P,,)) = sD(P,,) — ), Q(u) In
u
Sy P,(v)er " +A, (7A.25)
v
and stop.
Step 4: If A, — B, > ¢, change n to n + 1 and go to step 2.
PROBLEMS
7.1 Prove inequality (7.2.47), using a proof similar to the proof of Lemma 1.2.2 for 1(2,; Yy) given in
Chap. 1.
7.2 For sequences of length N define
Pp. w = {Py(v|u): 3 z Q,(u)Px(v |u)dy(u, v) < D}
and
show that
R(D) = R,(D) | Bee aa
7.3 (Gallager [1968]) For Theorem 7.2.4 show that
—E"(p; P) < C, = 2 + 16[In A}?
for —5 <p <0. It is convenient to define
a(u) = Y P(o)Q(u|o) +”
P(o}Q(u|v)r*”
a(u)
B(v|u) =
7! The choice of s = 0 yields R(D,,,,) = 0, where D,,,, is given by (7.6.7). The choice of s < 0 yields
the point (R(D), D), where the slope is s.
460 SOURCE CODING FOR DIGITAL COMMUNICATION
and show that
—E,(p, P) = in Y a(u)'*”
—E{(p, P) = d c(u) d B(v|u) In Olu ss
= @M\u viu P(e)
= Zot ale) nse
and
Bile, P) < Fo) ¥ plelay|in oo
7.4 Show that the source coding theorem given by Corollary 7.2.2 remains true when either D + ¢ is
replaced by D (provided D > D,,;,) or R(D) + € is replaced by R(D).
7.5 Prove Theorem 7.3.2 using the proof given in the converse source coding theorem (Theorem 7.2.3)
and the data processing theorem (Theorem 1.2.1).
7.6 A source and channel are said to be matched to each other when the channel transition probabili-
ties satisfy the conditions for achieving R(D) of the source, and the source letter probabilities drive the
channel at capacity. Here R(D) = C where the time per source output is equal to the time per channel
use. Show that when a source and channel are matched there is no need for any source and channel
encoding to achieve ideal performance. Examine the equiprobable binary source with error distortion
at fidelity D and the binary symmetric channel with crossover probability « where « = D.
7.7 Consider the source encoder and decoder of Fig. 7.3. If fidelity D can be achieved with a code of
rate R > R(D), show that the entropy of the encoder output
1
r
1 M
where P,,, is the probability of index m € {1, 2, ..., M}, satisfies
R(D) < H(¥4)<R
7.8 For an arbitrary DMC, show that the expurgated exponent E,,(R) satisfies
E,,(R) =D
where R = R(D) is given by (7.3.17). Also find the necessary and sufficient conditions for equality.
Hint: Examine Prob. 3.21 and show that
R(D,) > R,(D,) = sD, — In y(s, q)
7.9 For a DMC, define the Bhattacharyya distance given by (7.3.16) and the natural rate distortion
function given by (7.3.17). Then prove a generalized Gilbert bound for this distance measure, analo-
gous to Theorem 7.3.3.
7.10 (Analysis of Tree Codes, Jelinek [1969]) Suppose we have a binary symmetric source with the error
distortion measure. From (7.2.42), the rate distortion function is given by R(D) = In 2 — #(D) nats
per source symbol where 0 < D < 4. We now consider encoding this source with a binary tree code of
rate R = (In 2)/n nats per source symbol where we assume R > R(D). This tree code has n binary
representation symbols on each branch. Let T, be such a tree code that is terminated at / branches.
Then for source sequence ue W,,, we define d(u|7,) as the minimum normalized error distortion
between u and paths in the terminated tree T,. A larger terminated tree T, where L = ml can be
constructed from many terminated trees of length | by attaching the base nodes of these trees to
terminal nodes of other trees. We now consider an ensemble of terminated tree codes where all branch
binary representation symbols are independent and equally likely to be “0” or “1”.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 461
(a) Taking the expectation over the tree code ensemble for any source output sequence u, show
that
D* = lim E{d(u| T,)}
l- a
= lim E{d(u| T,,,)}
m- oO
exists and is independent of the source output sequence.
(b) Next, given any 6>0 and output sequence ue %,,,,, define u = (u,, u,, ..., u,,) where
u; € %,, for each i = 1, 2, ..., m and variables zy, z,, Z,, Z,, associated with code tree T,,, as follows:
Zo = 1
z, = number of paths with distortion D + 6 or less from u, over the first / branches
z, = number of paths extending the z, paths above with distortion D + 6 or less from u, over the
second / branches
Zp, = number of paths extending the z,,_ , paths above with distortion D + 6 or less from u,, over the
mth | branches
It follows from the branching process extinction theorem (Feller [1957]) that, over the tree code
ensemble, Pr{z,, > 0} decreases monotonically with m and approaches a strictly positive limiting value,
lim Pr{z,, >0}>0
provided E{z,} > 1. Using Chernoff bounds (see App. 8A), show that
E{z1} = 2' Pr{d,(u, v) < D + d|u}
> 21 — 1/nld2)e~ RWD)— RD)
= (1 — 1/152) e™R- RWD)+ RDI)
Hence for small 6 such that R > R(D) — R’(D)6, we can find | large enough to have E{z,} > 1.
(c) Assuming E{z,} > 1 use the branching process extinction theorem in (b) to prove
lim Pr {d(u| T,,,) < D + 6} >0
m-> oO
(d) From the definition of D* given in (a), we can choose / large enough so that
D* — 6 < E{d(u|T,)} < D* + 6
For such a choice of | show that
lim Pr {d(u| T,,,) < D* — 5} =0
Hint: For u = (u,, u,, ..., u,,), note that
1 m
d(u|T,u) = — > 4,(u;, v;)
mM j=1
where v = (v,, V5, ..., ¥,,) is the minimum distortion path sequence in T,,,. But for i = 1, 2, ..., m
d,u;, ¥;) > d(u;| T})
where 7; is the subtree of T,,, in which v; belongs. Thus
Pr {d(u| T,,,) < D* — 5} < Pr - Yalu T;) < D* - |
462 SOURCE CODING FOR DIGITAL COMMUNICATION
(e) Note that we have from (c)
lim Pr {d(u| T,,,) < D + 5} > 0
and from (d)
lim Pr {d(u| T,,,) < D* — 5} =0
From this show that, for any € > 0, there exists a binary tree code of rate R = (In 2)/n > R(D) such that
the average distortion D* satisfies
D*<D+e
Here the average is taken over all source output sequences.
7.11 Consider the same source coding situation presented in Prob. 7.10 where we consider binary tree
codes of rate R = (In 2)/n nats per source symbol. For fixed source sequence u € W,,, we define over
the tree code ensemble the probabilities
t te6 L.A
G(t|1) = Pr la T,)>— owes
(t |!) s | T,) i Pree
where
0 t £0
Git 0) = |
(t |0) Figgeernes*
Show that
G(t|/+ 1)= 23 a Jara — ejo]
Numerical solutions to G(t|1) show that for this symmetric source, tree codes also exhibit the doubly
exponential behavior observed for block source coding of symmetric sources with balanced distor-
tions (see Chap. 8).
7.12 Show that for continuous-amplitude memoryless sources if the condition on the distortion given
by (7.5.1) is replaced by
‘ Q(u)d*(u, 0) du < dj for a> 1
then the source coding theorem (Theorem 7.5.1) has the form for (7.5.24) given by
d(B) < D + do e7 a~ WalNELR, D)
7.13 Show that convexity of R(D) implies that R(D) is a continuous strictly decreasing function of D
for Din < D < Dax. Show that it further implies that if P € A, yields R(D) = I(P), then D(P) = D.
7.14 Let R(D) be the rate distortion function for a discrete memoryless source with distortion measure
{d(u, v): ue &, v € ¥}. Now consider another distortion measure defined as
{d(u, v) = d(u, v) — min d(u, v): ue Wve Vv}
vev
and let R(D) be the corresponding rate distortion function. Show that
R(D) = R(D + D
ad
where
= d Q(u) min d(u, v)
vev
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 463
7.15 Consider a source alphabet Y% = {0, 1} with probability Q(0) = Q(1) = 4. Let the representation
alphabet be ¥ = {0, 1, 2} and the distortion defined as
0 ¢=0
d(0, v) = 1 v=1
a =3i2
Ee v=0
ee v=1
a g=2
where « < 4. Sketch R(D) for this case. For « > 4, show that R(D) = In 2 — #(D).
7.16 Suppose ¥@ = ¥ and d(u, v) = 1 — 6,,. For
0<D<(A-—1) min Q(u)
show that
R(D) = H(%) — #(D) — D In (A — 1)
7.17 (Gallager [1968]) Consider a discrete memoryless source with four equiprobable outputs from
U = {1, 2, 3, 4}. Let VY = {1, 2, 3, 4, 5, 6, 7}, and distortion be given by
0 g= 0
1 g=lo¢g2and o= 5
d(u, v) =< 1 u=3or4andv=6
3 o=7
00 otherwise
Show that the rate distortion function is given as shown:
R(D)
In 2
| | D
1 2 3 Figure P7.17
Note: With infinite distortion measure the source coding theorem still holds if there is a v* € W
such that }°, Q(u) d(u, v*) < oo (v* = 7 in this example). (For further results concerning infinite distor-
tion measures, see Gallager [1968]. Also note the discussion following Theorem 7.5.2.)
7.18 For parameter s < 0, show that if a probability density {P,(v): — 00 < v < o} satisfies (7.7.46),
then Shannon’s lower bound is tight. That is, R(D,) = R,,(D,).
7.19 For memoryless sources the Shannon lower bound, R, ,(D), is given by Theorem 7.7.4.
(a) Show that the maximizing value of s < 0 in the definition of R, ,(D) satisfies
| d(z)G,(z) dz =D
where
es4(z)
G2) = Fae) da
464 SOURCE CODING FOR DIGITAL COMMUNICATION
(b) Next let Y, be the set of all probability densities for which
| d(z)G(z) dz < D
and use variational calculus to show that
R,,(D) = h(%) — max h(G)
Ge¥Fp
7.20 (Berger [1971].) In Prob. 7.19, let d(u, v)= |u — v].
(a) Show that
(b) For R(D) = R,,(D), {P(v): —0 <v < co} must satisfy (7.7.46) (see Prob. 7.18). Show that
then P(v) must satisfy
P(v) = Q(v) — D?Q”(v) —0<v<©0
(c) Apply (b) to a source with
a
ot ca —0 <u<0oo
and show that
RD) Bs dh Bape.
0
(d) Apply (b) to a source with
2
Qu)=—(1+u*)? —-0o<u<@
and show that
4n 1
R(D) = R,,(D) = In |— ] - 3 0<D<—=-
(D) = R,,(D) = n (7) =D a
7.21 (Berger [1971].) For a memoryless Gaussian source and a difference distortion measure
other than d(u, v) = (u — v)*, show that the Shannon lower bound is never exact. That is, R(D) >
R,,(D) for all D. Note the example of d(u, v) = |u — v| in Sec. 7.7.
Hint: Use Cramer’s theorem.
7.22 (Linkov [1965] and Pinkston [1966].) For a memoryless Gaussian source, find R,,(D) for
d(u, v) = |u — v|*. Check your results by specializing to « = 1 and « = 2.
7.23 Generalize the lower bound in Lemma 7.6.4 to continuous-amplitude memoryless sources. Then
show that this becomes the Shannon lower bound when d(u, v) is a difference distortion measure.
7.24 Using the calculus of variations (see Courant and Hilbert [1953], chap. 4), prove Theorem 7.7.1.
7.25 (Countably Infinite Size Alphabet) Discrete memoryless sources with a countably infinite size
alphabet and unbound distortion measures have coding theorems that are given by Theorem 7.5.1 and
Theorem 7.5.3, where R(D) is given by (7.5.23) with integrals replaced by summations.
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 465
(a) For a discrete memoryless source with a countably infinite alphabet of integers .% = {0, 1,
2,...} and the magnitude distortion measure, d(u, v) = |u —v|, show that R(D) > R,,(D) where
R,,(D) is given parametrically by
R,,(D,) = H(@) + sD, — In (1 + e*) + (1 — Q(0)) In (1 — e*)
and
e e
D,= Pease ee 20) >
for s < 0.
Hint: Choose A(u) as follows (See Tan and Yao [1975].):
10)= g@\li+e)
ease goles
(b) Show that necessary and sufficient conditions for R(D) = R,,(D) in (a) is that there exists a
probability distribution {P(v)} that satisfies
(1+e)0(0) u=0
Bre ie
7 eo |
7.26 In Prob. 7.25, let the source have a Poisson distribution
A
Ou)=—e’ a> 60=0, 42...
u!
Show that R(D) > R,,(D), for all 0 < D < D,,,,
7.27 In Prob. 7.25, let the source have a geometric distribution
Q(u) = (1 — 6)" 6<8<Aw- 61,2...
For this case show that R(D) = R,,(D) for all 0 < D < D, where
eX eX
and s, is given by
bea if 0* — 403 + 402 + 40-4 <0
eee He oF +1" ka eee
22 — 8) \| 22-0) | ~2~0I otherwise.
(See Tan and Yao [1975].)
7.28 (Shannon’s First Theorem Revisited, Gallager [1976].) Consider a DMS with alphabet WY and
probability distribution Q(u), ue W@. We encode each sequence ue Wy of length N by an index
f(u) = me {1, 2, .... M} where M = e®". The index m is sent over a noiseless channel to a source
decoder that estimates the sequence by
466 SOURCE CODING FOR DIGITAL COMMUNICATION
That is, i is chosen among all i which satisfies f(t) = m and maximizes Q (it), the probability of the
sequence i. We want to show that there exist encoders [described by f(u) = m] such that
lim Pr {a 4 u} = 0
N>o@
as long as R > H(%), the entropy of the source. This system is shown below.
Sk tea a 71
ueUy Source | Source A
DMS mi Noiseless’ «| 1 decoder “y
H (wu) m patencats ee channel | a= max! Oy (u) <pkeaNis
on £ oo ne f(u)=m
me {1,2,...,M},M=eRN
Figure P7.28
(a) Define the functions
Ae f(u) =f (a
Mea ig pw esa)
| Q,(t) > Qy(u)
O(u, = a
HQ)= 9 OM) < Qy(u)
Show that for anyO <p <1
Pr {a # u} < )) Qv(u)} DY (u, a] f)O(u, a \a)|"
atu
(b) Next, show that for any0 <p <1
Pr {a #us<) on(uy'*| Y vu, ti niu r |
(c) Randomly choose a source encoder function f such that over the ensemble of encoder
functions for any u + i and any m, m we have
Pr {f(u) = m, f (a) = m|u, a} = Pr {f(u) = m|u} Pr (f(a) = m|a}
1
~ M2?
Averaging Pr {t + u} over the ensemble of encoders, show that
1+)
Pr wea) <m-*/5 n(x 9< pc!
(d) Prove the source coding theorem from above by showing that for any N there exists an
encoder and decoder such that
Pruastg<ce ""
where E,(R) > 0 for R > H(%).
7.29 (Slepian and Wolf Extension to Side Information.) Suppose in Prob. 7.28 only the source decoder
has additional side information v € Y, such that when the source sequence is u € %, then the source
decoder receives index m from the channel as well as v. Here u, v have joint distribution
Qy(u, v) = IT Ou,» Pn)
RATE DISTORTION THEORY: FUNDAMENTAL CONCEPTS FOR MEMORYLESS SOURCES 467
The source decoder chooses t where
Prove the generalization of Prob. 7.28 for this side information at the decoder situation, by showing
that for any N there exist encoders and decoders such that
Pr {a $+ u} < e NEAR)
where E,(R) > 0 for R > H(%|V7).
7.30 (Joint Source and Channel Coding Theorem, Gallager [1968].)
(a) Let p,(v|x) be the transition probability assignment for sequences of length N on a discrete
channel, and consider an ensemble of codes, in which M codewords are independently chosen, each
with a probability assignment g,(x). Let the messages encoded into these codewords have a probability
assignment Q,,, 1 <_m < M, and consider a maximum a posteriori probability decoder, which, given
y, chooses the m that maximizes Q,, py(y|x,,)- Let
P=) 0..P..m
be the average error probability over this ensemble of messages and codes, and by modifying the proof
of (3.1.14) where necessary, show that
sa M l+p l+p
P< | >. gut+el 2» y ay(x)py(y |x)" 4
m=1 y x
(b) Let the channel be memoryless with transition probabilities p(y|x), let the letters of the code-
words be independently chosen with probability assignment q(x), and let the messages be sequences of
length L from a discrete memoryless source W@ with probability assignment Q(i), 1 <i < A. Show that
p —NE,(p, q) + LE,(p)
| iE
Ep) = (1 + p) In z aun]
(c) Show that E,(0) = 0, that
0E(p)
op
= H(%)
p=0
and that E,(p) is strictly increasing in p [if no Q(i) = 1].
(d) Let A = L/N, and let N > o with 2 fixed. Show that P, +0 if AH(W) < C where C is the
channel capacity.
CHAPTER
EIGHT
RATE DISTORTION THEORY:
MEMORY, GAUSSIAN SOURCES,
AND UNIVERSAL CODING
8.1 MEMORYLESS VECTOR SOURCES
Chapter 7 presented the rate distortion theory for memoryless sources thet emit
discrete or continuous random variables. For these sources, an output occurs once
every 7, seconds, and the sequence of outputs are independent random variables
with identical probability distributions. We can extend these results to mem-
oryless sources with outputs that belong to more abstract alphabets. For example,
the output of the memoryless source may be a random vector, a continuous-time
random process, or a random field. By generalizing in this manner, we can extend
the theory to more general sources with memory.
Consider a memoryless source that outputs every 7, seconds a random vector
of dimension L denoted by x. Here
eee deg) 6. ce, jae) (8.1.1)
where!
uu 6 & = {a,, ds, ~.. Og} oy fe een ?
‘ We can equally allow each component to belong to a different alphabet. Although x is a vector,
we regard it as a letter from some abstract alphabet 7 = W,.
468
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 469
Denote the alphabet for all such vectors by 2 = W, and assume that the proba-
bility distribution for x € ¥ is given by Q(x) where x € 2. Note that the com-
ponents of x are not necessarily independent. We represent the L-dimensional
vector source outputs by vectors
y = (v®, o, ..., o) (8.1.2)
belonging to the alphabet Y¥ = ¥, where VY = {b,, b,, ..., bg}. Throughout this
discussion, assume that for each source-user pair of vectors xe W, and ye ¥ _,,
we have a bounded distortion defined by the set of L distortion measures
d(u, v) < df < « for allue Wve Vv, 1=1,2,...,L (8.1.3)
The memoryless source that emits a vector of dimension L every T, seconds
can be viewed as L memoryless sources with outputs every T, seconds that are not
necessarily independent of each other. This description is shown in Fig. 8.1, where
we assume that only one noiseless channel is available. From this viewpoint, we
have L users who seek an estimate of the corresponding L source outputs, and
each source-user pair has a distortion measure given by (8.1.3).
Although a single-letter distortion measure is given for each source—user com-
ponent pair, there is no overall fidelity criterion for evaluating or designing a
source encoder-—decoder system. A vector distortion measure consisting of the L
single-letter distortion measures, for example, is inadequate because two systems
yielding two average vector distortions generally cannot be compared, since vec-
tors, unlike real numbers, cannot be completely ordered. Therefore, we require
some overall real-valued distortion measure to proceed further in our analysis.
Next, we consider two such distortion measures.
yD (1)
Source 5 ene Ge
I Source Source I
uw) (2)
Source} | . . =| -e----- 4 Use
5 > Encoder ¥ Decoder > ) :
| : |
Noiseless
channel ag
|
ae Bae 4
Ey (L)
Source u a Vv" _| User
3 ae re L
Figure 8.1 Multiple source—user system.
470 SOURCE CODING FOR DIGITAL COMMUNICATION
8.1.1 Sum Distortion Measure
A natural choice for a single real-valued distortion measure is the sum distortion
measure between x € Z and ye Y defined by
io
v(x, y) = du, v®) (8.1.4)
l=1
where
ett, ae pga)
y an Fo sol, 20), 0)
For sequences of N successive terms x € 2y and y € Wy the obvious generalization
1s
(8.1.5)
1 N
Y X, 7 et ae Xn» n
nl% Y) =~ 2 1 Yn)
L
ie ¥ dO(u, vy) (8.1.6)
l=1
where
1 N
d?(u, vy) a N > d(u, vo) E mee, bo ody ida ab (8.1.7)
n=1
Since d(u, v) < di) < oo for all | we have
L
(x, y)< Yo = pe ge 00
1=1
Hence, for the sum distortion measure defined above, we have reduced the prob-
lem to a single discrete memoryless source with alphabet %, probability Q( - ),
representation alphabet ¥Y, and a bounded single-letter distortion measure (x, y).
The coding theorems of Sec. 7.2 apply directly and the rate distortion function is
given by [see (7.2.53)]
R(D) = min I(P) (8.1.8)
where
y= {Ply|x): DE OIC |xW76s ») < D| (8.1.9)
From Theorem 7.6.1, necessary and sufficient conditions for Pe Pp to
achieve R(D) = I(P) are given by
Ply |x) = a(x) P(y)e"™” if P(y) >0 (8.1.10)
and
SY A(x)O(x)er" <1 if P(y) =0 (8.1.11)
x
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 471
where
-1
= (¥ P(y)e"™ “ xex (8.1.12)
y
and s <0 satisfies the parametric re for R(D)
D = 2 dA A(x)Q(x)P(y)e?’™ ”y(x, y) (8.1.13)
and
R(D) = sD + ¥° Q(x) In A(x) (8.1.14)
Note that the components of each x are not necessarily independent, although the
successive vectors x,, X2, ..., Xy are mutually independent. A special case of
interest is when the L components of the source output vectors are independent of
each other, so that in the description of Fig. 8.1, we have L independent mem-
oryless sources.
Lemma 8.1.1: Independent components—sum distortion measure For in-
dependent components and the sum distortion measure, the rate distortion
function is given parametrically by
L
Dez >, D® (8.1.15)
l=1
and
L
R(D,) = pa R©(D®) (8.1.16)
l=1
where R‘(D) is the rate distortion function for the /th component with the
Ith distortion measure, and is given parametrically by the same parameter
50 for ell be A, 2) i230.,-L.
ProoF Let {Q(u): u € Z} be the Ith source output component probability
distribution. Recall that the distortion measure for this component is given by
d(u, v). (We can regard each component as an output of some discrete
memoryless source.) Suppose the conditional probability P(v|u) achieves
the rate distortion function of the /th component sequence for parameter
s <0 and thus satisfies
P(v|u) = AP(u)PO(v)es” if P(v) > 0 (8.1.17)
and
d Au) Q(ujesPM”<1 if Pv) =0 (8.1.18)
where
1
A(u y= 5 Pm POY(v)edu le ue WU (8.1.19)
472 SOURCE CODING FOR DIGITAL COMMUNICATION
and that it also satisfies the parametric equations
DO = — d 2 AO(u yo” (u)P(v Je sdO(u, dO (u, v) (8. 1.20)
and
R(D®) = sD® + 2 Q(u) In A(u) (8.1.21)
Since the sources are independent, we have
L
a eau (8.1.22)
Defining
P(y|x) = rl P(v® | u) (8.1.23)
and .
A(x) = Il A(u) (8.1.24)
we see that this choice of {P(y|x): ye Y, x € #} and {A(x): x € %} satisfy the
necessary and sufficient conditions of (8.1.10) to (8.1.14), giving the desired
result.
One expects that when the L source output components are not independent,
then the rate distortion function is upper-bounded by the corresponding rate
distortion function when we assume the components are independent. We show
this next.
Theorem 8.1.1 For the sum distortion measure, the rate distortion function
R(D) is bounded by R(D), the rate distortion function obtained if the source
output components are independent with the same marginal probability dis-
tributions. That is,
R(D) < R(D) (8.1.25)
where R(D) is given by (8.1.15) and (8.1.16).
PRooF Recall that for any Pe A,
R(D) < I(P) (8.1.26)
But from Lemma 1.2.1
< YY O(x)P(y |x) In P(y|x) (8.1.27)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 473
for any probability distribution P(y). Choose
#
P(y) = I] PO(v) (8.1.28)
l=1
and
E
P(y | x) = I] PO(v® |u®) (8.1.29)
l=1
where {P( - )} and {P( - | - )} correspond to P € Apw that achieves R(D)
a eS oe ae See Gees SS
LPO | y)
I(P)<> , Ox )P(y|x) In Il PO(p)
PY(v | u)
» 2 Q(u)PO(v|u) In “P%)
|
Ip4n *
Mr Mr
R(D) (8.1.30)
These theorems also hold for continuous-amplitude random vectors under
the bounded variance condition on the distortion measures of each of the L
source-user component pairs. The memoryless source with vector outputs,
together with the above sum distortion measure, is a very useful model in under-
standing the problem of encoding sources with memory. We will return to the
above results when we discuss both discrete-time and continuous-time sources
with memory.
Example (Gaussian vector sources, squared error distortion) Suppose we have a memoryless
source that emits every T, seconds a vector with L independent zero-mean Gaussian components
where
@
[ wQ%u)du=o? 1=1,2,...,L (8.1.31)
“—@
Also let d(u, v) = (u — v)? for 1= 1, 2,..., L. For the Ith source-user component pair (see
Sec. 7.7), we have the rate distortion function
ap [ton ODP sol
(DS) = D; (8.1.32)
0 o? < D®
with slope (Lemma 7.7.1)
: det O< DP =
Oyo 2D” og
RO(D) x s (8.1.33)
~ dD®
474 SOURCE CODING FOR DIGITAL COMMUNICATION
for | = 1, 2,..., L. Hence for common parameter s, for each component
1 S 1
2s ses GO;
D = (8.1.34)
: <s<0
o —-~—><s
ec
or
D” = min (6, a?) (8.1.35)
where
C= >0 (8.1.36)
ties. a: ed
For the sum distortion measure, the rate distortion function is thus given in terms of parameter
0 as 2
D,= }, min (0, of) (8.1.37)
l=1
and
L a2
R(D,) = > max (0.4 In +) (8.1.38)
l=1
For small distortions D < min {o7, 03, ..., 07}, this becomes
L o2
R(D) =41n (Ii “| (8.1.39)
resis
8.1.2 Maximum Distortion Measure
For the memoryless source with vector outputs, where there is a set of L source-—
user component distortion measures (8.1.3), another natural choice for a single
real-valued distortion measure for sequences of length N is to define the distortion
between x € #y and ye YW, as
Yw(X, y) = max {dy(u", v°) — 6}
l
fb ww |
max j— >) du, vo?) — 91 (8.1.40)
l n=1
where 6,, 0,,..., 0, is a set of nonnegative real numbers. Recall that for each
Red 2 re
Xe (UP a)
Sea i} (8.1.41)
i he ee ed
This distortion measure is essentially the maximum of the L distortions of the
source-user pairs. The bias parameters {6} allow for control of the amount of
distortion of each source-user pair. Since we attempt to minimize distortion
n(x, y), this can be viewed as a minimax approach. Note that although the source
is memoryless, the above distortion is no longer a single letter distortion measure.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 475
That is
1u(X 9) FY Yala In (8.1.42)
n=1
Hence it appears that the coding theorem of Sec. 7.2 will no longer apply.
However, with only slight modifications, the earlier coding-theorem proofs can be
applied to this maximum distortion measure.
For a code & = {y;, y2,.--, Yu} of block Jength N and rate R = (In M)/N nats
per source symbol, the average distortion is
WB) = ¥ On(x)y(x |)
= )) Qn(x) min yy(x, y) (8.1.43)
yeB
Now for any conditional probability distribution {P(y|x): y € Y, x € 2}, consider
a code ensemble where all codeword components are selected independently
according to {P(y): y € Y} where
P(y) =), Ox)P(y|x) yey (8.1.44)
Then following Sec. 7.2 leading to (7.2.30), we find that the code-ensemble average
of »(Z) is bounded by
WB) < yx(P) + yoe ERP) (8.1.45)
where
yn(P) = d d Ox(x)P yxy |x)yn(x, y)
= YY Ov(x)Pa(y |x) max {d(w?, v)— 0} (8.1.46)
and where one |
Yo = ya (8.1.47)
E(R, P) = ga E(R; p, P) (8.1.48)
E(R; p, P) = oo E,(p, P) (8.1.49)
Ep. P)= —nD|Y PUjoge|yyarm] (8.1.50)
and 2
Q(x|y) = Pl Lx)0C) (8.1.51)
P(y)
476 SOURCE CODING FOR DIGITAL COMMUNICATION
The only change from the form in (7.2.30) is the term y,(P) which is bounded
further by defining sets
A, = {(x, y): dO (u, v) — 0, > 0} te Be eee & (8.1.52)
and the union
i. | |
LES, = \(* y): max {d?(u®, v) — 6} > a (8.1.53)
l=1 l
Then, restricting the sum to .~ and using the union-of-events bound, we have
yr(P) < bd On(x)Pr(y |x) we {dy(u, v) — 0;
(x, y)Ee
< max df ¥ >) Qy(x)Px(y|x)
l (x, yhe of
Sy Pre yew = |) #i Pi
L
< yo >) Pr {(x, y) € a, |P}
1=1
L
= 79 ¥ Pr {d(u, v) > 6, |P} (8.1.54)
l=1
Using this to further bound (8.1.45) gives
metus? i
WB) < Yo Ss Pr {dO (u, vy) > 0, | P} + ye Uh (8.1.55)
l=1
where |
E(R, P) >0 for R > I(P) (8.1.56)
Suppose we now wish to encode in such a way as to achieve average distor-
tions {D: ] = 1, 2, ..., L} for each source-user pair. Let D = (D“, D®, ..., D™)
be the desired vector distortion. Consider the average of d'(u, v) over the joint
distribution O(x)P{y|x) = Ol”, a!, 2) ar PO OR ee, 2)
YY Q(x)P(y|x)d(U®, v®) = YY Q(u)PP(v|u)d(u, v) (8.1.57)
where Q(u) and P(v|u) are marginal distributions of Q(x) and P(y|x), and
define the class of conditional probability distributions
Pa Pov) S > O(x)P(y|x)d@(u®, 0) < D®;1= 1, 2,..., EL} (8.1.58)
We now define the vector rate distortion function as
R(D) = min I(P) (8.1.59)
Pe Ap
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 477
To show that R(D) indeed is the rate distortion function for encoding each
source-user component pair with distortion {D: | = 1, ..., L}, we must prove a
coding theorem and its converse.
Theorem 8.1.2: Source coding theorem—vector distortion Given ¢ > 0 and
desired distortions {D: ]= 1, 2,..., L} for the L source-user component
pairs, there exists an integer N, such that for each block length N > N, there
exists a code BZ = {y;, yo, .--, Yu} of rate R < R(D) + € for which
Y° Qy(x) min max i = du, vp?) — D®| cs (8.1.60)
yeZR I
That is, the /th source—user pair has average distortion less than or equal to
D® + ¢ for I = 1, 2, ..., L.
ProoF In equality (8.1.55), choose parameters 6, = D + ¢/2. Then for
each |
a8 1 d?(u, p) etal
7 (t-6)} > D +5 iy
"\N , 2
(8.1.61)
For Pe, and source distribution Q(-), the terms {d(u, v):
n=1,2,..., N} are independent, identically distributed random variables
with mean values less than or equal to D". Hence by the weak law of large
numbers
lw QM yO (I) pl_p
lim Pr 9 u®”,-y®) > D® +5
Noo
for any P € Ap. In particular, let P € Ap achieve R(D) = /(P). Then from
(8.1.55) and (8.1.62), for any R > R(D) there exists an integer N, such that, for
any block length N > N,
P|= = (8.1.62)
L
W(B) < Yo d, Pr avtu, v) > D® + 5 P| + ype NER P)
(8.1.63)
Hence there exists a code # of rate R < R(D) + « and block length N>N,
such that
»(Z) = ¥ Qy(x) min max au. vi) — | D® + 5 )
x yeRZR I
(8.1.64)
or
“4
>)
=,
ia}
E.
5
=
9
>
Zle
iMz
d(u, vo) — D® Ex (8.1.65)
1
478 SOURCE CODING FOR DIGITAL COMMUNICATION
Theorem 8.1.3: Converse source coding theorem—vector distortion If a
code B = {y;, y2,---, Ym} of block length N and rate R = (In M)/N achieves
average distortion {D: ] = 1, 2, ..., L} for each of the L source-user com-
ponent pairs, then R > R(D) where D = (D"?, D™, ..., D™).
ProoF For the code Z = {y,, y2, ..., Yu}, define the conditional probability
1 EB and xy) = B)
Py(y|x) = y Ywl% y) = r(x |) (8.1.66)
0 otherwise
Then since code Z achieves average distortion D” for each I, it follows that
>»: Onl Ay ix ee 9) 2 BO lee 3 oy Ley. 40.167)
> See
Let P(y|x) be the nth marginal distribution of Py(y|x) and define the
probability distribution
Polx)= x © PMls (8.1.68)
Then for each |
N
YD Qn(x)Pr(y |x)dy(u, v?) = Qy(x)Py(y |x) Fe 2. d®(u, v)
ee f i.
2
1 LE OPM | x14”, 0)
=¥E 6)(5 F Pls) }aru, 0)
= SE Ox PCr x),
< D® (8.1.69)
R(D) <1(P)
= i, y p) (8.1.70)
From inequalities (7.2.46) and (7.2.47), we have bounds
—
z
z
o
ee aN
lA
ee A ee
(8.1.71)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 479
Theorems 8.1.2 and 8.1.3 establish R(D) as the rate distortion function for the
vector source with a maximum distortion measure. For the special case where the
L source components are independent, we have the following corollary.
Corollary 8.1.4: Independent components—maximum distortion When the
L source components are independent so that
Q(x) = [J Ow)
l=1
where x = (u"”, u, ..., u™), the rate distortion function is
L
RD) = > ROD") (8.1.72)
l=1
where R”(D) is the rate distortion function of the /th component of the
source output vector.
ProoF This follows directly from the proofs of Theorems 8.1.2 and 8.1.3 with
the further independence P(y|x) = []/., P®(v®|u®). Heuristically, since the
sources are independent in the multiple source-user description of Fig. 8.1,
the source encoder is forced to send, for each source—user pair, enough infor-
mation to achieve its distortion independent of information for other source—
user pairs.
8.2 SOURCES WITH MEMORY
Although many sources that arise in practice can be modeled as discrete-time
sources with real-valued output symbols, they generally have memory of some
sort. By taking advantage of the statistical dependence between source output
symbols, for a given fidelity criterion, sources with memory can be encoded using
fewer bits per source symbol than with corresponding memoryless sources. For a
given average distortion level D, the rate distortion function for a source with
statistical dependence between output symbols is less than for a corresponding
memoryless source. Theorem 8.1.1 shows this to be true in a special case. Indeed,
for memoryless sources, the data rate cannot be reduced without incurring large
distortions. For this reason, source coding techniques of rate distortion theory are
mainly worthwhile for sources with memory. In this section, we examine discrete-
time stationary sources with memory and define the rate distortion function for
discrete-time stationary ergodic sources.
Many sources, such as speech, are modeled as continuous-time sources. Con-
tinuous-time sources can be treated as discrete-time sources with source alphabets
that are time functions. By considering general alphabets, we can treat a large
class of sources, including picture sources such as television. Coding theorems for
discrete-time stationary ergodic sources with general abstract alphabets are given
in Berger [1971]. In Sec. 8.4 of this chapter, we examine a few Gaussian source
examples of these more abstract alphabets.
480 SOURCE CODING FOR DIGITAL COMMUNICATION
Let us consider now a discrete-time source with statistically dependent source
output symbols. For convenience, attention is restricted to a source with discrete
output alphabet W = {a;, a2, ..., ay}. Let u=(..., u_4, Up, Uy, ...) denote the
random sequence of output letters produced by the source.” The source is com-
pletely described by the probabilities
Q,(a,, Ae, +++, Ay; = Pr {ts 4; = Hy, Une, = AQ, ---, Ue = a}
for all times t and lengths L. In general, little can be said about source coding of
sources which are nonstationary. Hence, we assume throughout this section that
the source is stationary; that is
Q,(a;, “2, o 416-9 Ors t) a 0,(, X2, cee ay) (8.2.1)
is independent of time t for all letter sequences {«,, «7, ..., «,} and all lengths L.
In addition to assuming that the source is stationary, we temporarily require
that the source also be ergodic. (Later, in Sec. 8.6, we shall relax this ergodicity
assumption by examining an example of a nonergodic stationary source which we
can encode efficiently.) Ergodicity is essentially equivalent to the requirement that
the time averages over any sample source output sequence are equal to the en-
semble averages. Specifically, let u = (..., u_1, Ug, Uy, ...) be a sample output
sequence and let u’ denote the sequence u shifted in time by / positions. That is
a 4° I > for aT (8.2.2)
Also, let fy(u) be a function of u that depends only on u,, u,,..., uy. Then a
stationary source is ergodic if and only if for all N > 1 and all such functions f,(u
for which
E{| fy(u)|} < 20 (8.2.3)
we have
LE
lim ~ Y fy(w) = El fa(u)} (8.2.4)
160 L l=1
for all source sequences u (except at most a set of probability zero). Here E{ - } is
the usual ensemble average. The ergodicity assumption will ensure that if a source
code is “good” for encoding a particular sample sequence with fidelity D, then it
will also be good for all sample sequences of the stationary ergodic source. This
will become even more evident when we consider an example of a nonergodic
stationary source.
Now suppose we have a discrete-time stationary ergodic source, as described
above, with discrete source alphabet W = {a,,a,..., a4}, representation
alphabet ¥ = {b,, bj, ..., bg}, and a bounded single-letter distortion measure
{d(u, v)} where 0 < d(u, v) < do for all u, v. The source probabilities are given by
* Continuous-amplitude sources can be easily handled by replacing probability distributions with
probability density functions. Coding theorems follow with an appropriate bounded moment condi-
tion similar to that imposed in Sec. 7.5.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 481
{Q, (uy, uz, ..., u,), L > 1}. Shannon [1959] and Gallager [1968] have shown that
the rate distortion function is given by
R(D) = lim R,(D) (8.2.5)
where :
R,(D)= min -1(P:) (8.2.6)
Fy = Pu lu): © > Q,(u)P_(v |u)d,(u, v) < D (8.2.7)
WP.) = 5 Qr(u)P (elu) in “Ol (8.28)
P,(v) = >, Q,(u)P_(¥|u) (8.2.9)
Paras ; > rere (8.2.10)
The coding theorem for this case is rather difficult to prove. A direct proof will be
given here only for the Gaussian source with the squared-error distortion meas-
ure. General proofs can be found in Gallager [1968] and Berger [1971].
We present instead a simple heuristic argument which requires an additional
assumption that appears reasonable for many real source models. Assume that
there exists a finite interval Ty) such that source outputs separated by Ty or more
units of time are statistically independent. That is, for two random source output
letters u, and u, at corresponding times t and t’, for which |t — t'| > Ty, we have
Olu,, uy) = O(u,)Q(u,) (8.2.11)
For many real sources, a model with such a finite interval of dependence seems
reasonable. From a mathematical point of view, this is a rather strong assumption
which simplifies our heuristic argument that (8.2.5) is the rate distortion function.
Consider grouping together consecutive source output symbols into groups
of length L + T,. Out of each group we only attempt to encode the first L
source output symbols and ignore the remaining Ty symbols (i.e., we neither
represent them nor send them, although the decoder knows that these last Tp
symbols are missing). Because of our assumption we then have a sequence of
independent identically distributed sets of source output sequences, each consist-
ing of L source output symbols. Defining
= = 7
Tote. we ae SS, (8.2.12)
ye his Ps; dvd) € GH 7%
and distortion
1 L
10%, YI : > d(u,, v,) < do (8.2.13)
482 SOURCE CODING FOR DIGITAL COMMUNICATION
we have a new extended discrete memoryless source with source probability
Q,(x) = Q,(u,, uz, ..., u,) for each letter x € 2 = W, and single-letter distortion
measure d,(x, y). Applying the results of Sec. 8.1 for a memoryless source with
vector outputs, the rate distortion function for the extended discrete memoryless
source is given by
R(D; L)= min I[(P,) _ nats/extended source symbol (8.2.14)
PL € PAp,L
where
Py. = \Pi(v|u): >) > Q,(u)P,(v |u)d,(u, v) < D (8.2.15)
Here the dimensions of R(D; L) are nats per symbol of the extended source. But
each extended source symbol corresponds to L + Ty actual source output sym-
bols. Hence, in terms of nats per actual source symbol, (8.2.14) becomes, using
(8.2.6),
R(D; L) 3
L+T, Ta f, R,(D) _ nats/source symbol (8.2.16)
Since the J) unrepresented source symbols can produce upon decoding the maxi-
mum distortion d,, this means that by using the above encoding strategy we can
achieve average distortion
T
ee do( =
ae Si 3
with a code of rate
#
R,(D
1+. 1e L(D)
Clearly, by letting L— oo, we can achieve average distortion D with a code rate
R(D) = lim R,(D)
La
Hence we see that if there exists a finite interval 7, such that source outputs
separated by TJ) or more units of time are statistically independent, then the
heuristic proof of the coding theorem follows directly from the coding theorem for
memoryless vector sources. For general stationary ergodic sources there are sim1-
lar (though more difficult to prove) coding theorems resulting in the definition of
the rate distortion function given in (8.2.5).
In the above discussion we assumed that R,(D) converged for stationary
ergodic sources. We can interpret R,(D) as the normalized rate distortion function
of a memoryless vector source of L components as described in Sec. 8.1. By
increasing L, more of the statistical dependence between source outputs can be
exploited, so that we expect the required rate per source symbol to decrease with
an increase in L. This is true for all stationary sources, as shown next.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 483
Lemma 8.2.1
R(D) = lim R,(D) = inf R,(D) (8.2.17)
Lo |e |
ProoFr Consider integers / and m and let N = / + m. Let P, and P,,, be condi-
tional probabilities that achieve R,(D) = (1/l)I(P,) and R,,(D) = (1/m)I(P,,)
and define
P,(v|u) = Pv'|u')P,,(v"|u") where v=vv"u=u'u” (8.2.18)
Then
dd On(u)Pr(y |u)dy(u, y)
— 2 d Ox(u)Py(v Oe du’, v') + 7 anu ")
l
N
= DY ClwyPiv'|u) du, v!) + YY Qn(u")Pn(V™ |U") d(u", ¥")
uly! um ym
-p (8.2.19)
Hence P,(v|u) belongs to Pp y, and from Lemma 1.2.1 of Chap. 1
a
pa
=
A
=
o
=
Py(viu
P,(v)
u
Ox(u)Pa(y|u) in al (8.2.20)
where P,(v) is any probability distribution. We choose P,(v) = P,(v')P,,(v”)
where
—
Q,(u)Py(v|u) In
v
IA |
fel ae |
th)
oa ta
cM
<M
8.2.21
and P,,(v") = > Q,,(u)P,,(¥™ | u) ( )
Hence
Ry(D) < {UP )) + 1(Pp)}
Pai m | 1
=x [mea] + 0 1e)
l m
= 5 RD) + 5, Ry(D) (8.2.22)
484 SOURCE CODING FOR DIGITAL COMMUNICATION
Now let R(D) = inf,, , R,(D). Then for any ¢ > 0, choose N to satisfy
R,(D) < R(D) + € (8.2.23)
From (8.2.22), letting 1 = m = N, we have
Ryy(D) < 2Ry(D) + 2Ry(D)
sas R,(D)
<R(D) +e (8.2.24)
Similarly
R,y(D) < R(D) + «€ for all k > 1 (8.2.25)
For any integer L, we can find k and j such that L = kN + j where 0 <j <
N — 1. Then
R,(D) <“~ Ryy(D) + 2 RD)
< - [R(D) + «] + — R,(D)
Since ¢ > 0 is arbitrary, we have
lim R,(D) = R(D)
L~>@®
Having given a heuristic argument to motivate the coding theorem for at least
a subclass of stationary ergodic sources, we next prove the general converse
coding theorem.
Theorem 8.2.1: Converse source coding theorem—stationary ergodic sources
For any source encoder-—decoder pair, if the average distortion is less than
or equal to D, then the rate R must satisfy R > R(D).
Proor Any encoder-—decoder pair defines a mapping from source sequences
to user sequences. For any length N, consider the mapping from Wy to Vy,
where we let M be the number of distinct sequences in Wy into which se-
quences of Wy, are mapped. Define
ie j1 if v is the sequence into which u is mapped
ae 8.2.27
\0 otherwise ( )
Py(v|u)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 485
Then if this mapping results in average distortion of D or less, we have average
distortion
D(Px) = > > Qy(u)Px(v |u)dy(u, v)
<D (8.2.28)
and hence Py € Ap y. Thus
In M (8.2.29)
Since
we have R(D) < (1/N) In M=R.
This converse theorem together with the heuristic proof of the coding theorem
completes our justification of R(D) given by (8.1.5) as the rate distortion function
for stationary ergodic sources.* This discussion easily extends to continuous am-
plitude stationary ergodic sources where, instead of a bounded distortion meas-
ure, we require a bounded moment condition on the distortion measure.
Another form of R(D) for stationary ergodic sources can be obtained using a
definition given in terms of random processes, rather than limits of minimizations
involving random vectors. This definition of a rate distortion function is analo-
gous to Khinchine’s process definition of channel capacity [1957]. Again consider
the stationary ergodic source described above. Next suppose there is a jointly
stationary ergodic random process {u,,, v,} consisting of pairs u, € Wandv,¢e V.
This implies that there is a consistent family of probability distribution functions
{Py(u, v): ue Wy, ve Vy} for all N which satisfies the condition
Q,(u) = > Py(u, v) for all N (8.2.30)
Given a stationary ergodic source, there always exists a jointly ergodic pair source
that satisfies this condition. Since the pair process is stationary, we can define the
average per letter mutual information
1
I — lim yn Lol ten: va (8.2.31)
No
Pp
* This converse theorem is true for nonergodic stationary sources if we interpret average distortion
as an ensemble average.
486 SOURCE CODING FOR DIGITAL COMMUNICATION
where the subscript p emphasizes the dependence of the particular pair processes.
In addition, we have average distortion
D, = E,tdy(u, ¥)}
1 N
et Ey 2 d(u,, o)
> >& P(u, v)d(u, v) (8.2.32)
For this particular jointly ergodic process, a sequence of block codes of rates
approaching I, can be found that can achieve average distortion arbitrarily close
te D,;
Pp
Theorem 8.2.2 Given any ¢ >0 and the jointly stationary ergodic process
defined above which satisfies the source condition (8.2.30), there exists a
sequence of block codes {#y} each of rate R < I, + € such that the average
distortions {d(By)} satisfy
lim d(#y)<D,+€
No
ProoF For any block code* By = {v1, V2,..., Vy}, we have average distortion
U(By) = 3 Qy(u)d(u| Ay)
=) > Py(u, v)d(u| By) (8.2.33)
where
d(u| By) = min dy(u, v) (8.2.34)
ve BN ;
Let
1 d(u|By) > dy(u, v)
@ 8 | 8.2.35
(u, v; Bx) \0 d(u| By) < dy(u, v) ( )
Then
d(By) = 2, 2: Py(u, v)d(u| Ay)[1 — ®(u, v; By)
+) > Py(u, v)d(u| By)O(u, v; By)
Noting that in the first term we have
d(u| By)[1 — O(u, v; By)] < dy(u, v) (8.2.36)
* Each codeword here corresponds to the choice of N random processes v;, V3, ..., Vy aS a
mapping for the N source processes uy, U2, ..., Uy.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 487
and in the second term we can have
d(u| By) < do (8.2.37)
then
d(By) <D, +d. > ¥ Py(u, v)O(u, v; By) (8.2.38)
Defining?
P,(v) = >> Py(u, v) (8.2.39)
we bound the second term using the H6lder inequality,
» > Px(u, v)®(u, v; By)
O(u, v; By)
<¥ y P,(v) ° et » P,(v)®(u, v; 4) (8.2.40)
for any —1 <p <0. Averaging the bound with respect to an ensemble of
codes of block length N and rate R where codewords are chosen indepen-
dently with probability distribution P,(v), results in
(E Paoyolu, vi ay)“ < (EPA VOa vi ay)) ”
1 “EP
= & + i
< M?
OE aes (8.2.41)
where the first inequality follows from the Jensen inequality and the first
equality follows from the complete symmetry of {v, v,, V2, ..., Vy}. Thus,
d(By) <D, + doe”? 2 P,(v) Pa seit ee (8.2.42)
for —1 <p <0. Certainly there exists at least one code 4, in the ensemble
which also satisfies this bound on the ensemble average.
Next, letting p = —a/N for any « > 0, choose N >a so that
p = —a/Ne(-1,0)
> As in the proof of Lemma 7.2.1, v is a dummy vector.
488 SOURCE CODING FOR DIGITAL COMMUNICATION
Now consider the identity
1 p*
oe Po
l+p C 1+
N)/)
== jive: Rea GA!
N 1-—(a/N)
=1+— +0(N) where —o(N) = (a/N)/[1 — (a/N)]_ (8.2.43)
and consider the inequality
| : PAu vy) mpd tal
Ef Pan Py
ae u | y Py(u, v) chad erg
FONE Pls een] —_|
a . 3 Qn(u)Pxv) 15 se ior) 1/(1 omy
(a/N + o(N)) A a/N
= (ry pxtuny fae
v(u v(v)
nah I} (8.2.44)
= | Pal, v) exp tee Qy(u)Px(v) ||
Substituting this into (8.2.42) yields for some code By
jon i
: (8.2.45)
for any « > 0 and N large enough to guarantee N > «. For jointly ergodic
sources (McMillan [1953]) we have
d(By) <D,+ do onan >” Py(u, v) exp
\“
a Py(u, v) ss
tin ny in 0,(u)Py(v) T. (8.2.46)
where the convergence is with probability one. Hence
lim d(fy) < D, + dye *8-'” (8.2.47)
N>o
Finally, choose R = I, + ¢/2 and a = (2/e) In (dg /e), where € < dg, so that
lim d(#y) <D, + €
No
lor Rr d+
According to Theorem 8.2.2, R(D) is the smallest possible rate for average
distortion D. Thus
inf I, > R(D) (8.2.48)
Dp<D
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 489
where the minimization is taken over all stationary ergodic joint processes that
satisfy D, < D and (8.2.30). It has been shown (Gray et al. [1975]) thai the
minimization can be taken with respect to all jointly stationary sources since all
jointly stationary sources that satisfy the minimization can be approximated
arbitrarily closely by a stationary ergodic source. In addition Gray et al. [1975]
have proven a converse coding theorem analogous to Theorem 8.2.1 which
establishes that
R(D) = inf IJ, (8.2.49)
Dp<D
For stationary ergodic sources, there are two equivalent definitions of the rate
distortion function, R(D), given by (8.2.5) and (8.2.49). In (8.2.5), R(D) is given as
a limit of minimizations involving random vectors, whereas in (8.2.49) it is given in
terms of minimizations involving random processes. In either case, R(D) is gen-
erally difficult to evaluate for most stationary ergodic sources. This is one of the
main weaknesses of the theory.
The most direct way to compute R(D) is to first find the form of R,(D) and
then take the limit as L > o0. R,(D), given by (8.2.6), can be interpreted as the rate
distortion function of an L-dimensional memoryless vector source, where the
vector components are not necessarily independent and the distortion between
vector outputs and representation vectors is the sum distortion measure. Thus
R,(D) is exactly the rate distortion function of a vector source with the sum
distortion measure discussed in Sec. 8.1.1. There we found a simple expression for
this rate distortion function when the component sources were independent. If by
an appropriate transformation we can reduce the calculation of R,(D) to that of
the independent component sources, then we can obtain an equally simple expres-
sion for R,(D) and often obtain R(D). We do this in the following example.
Example (Gaussian source, squared-error distortion) Consider a discrete-time zero-mean sta-
tionary Gaussian source with output sequence (..., u_,, U9, 4,,...) and correlation between
outputs u; and u,; denoted by
$i = O- = Eluju} —_ for alli, j (8.2.50)
Stationarity implies that this correlation depends only on |i — j| and since the source is Gaussian
it also implies that it is ergodic. We wish to calculate R(D) for this source with the squared-error
distortion measure
1
d,(u, v) = L p39 (u, — v,)
1
ers aed (8.2.51)
for any u, ve &, where we have “= ¥ = &. We begin by calculating R,(D).
The sequence u € #, has the joint density function
_ p-(1/2)u@- tur (8.2.52)
Q,(u) = (Qn)E2 ||! 2
490 SOURCE CODING FOR DIGITAL COMMUNICATION
where
abe Sees hk
d, Po owe ces
® = cae een ues mores (8.2.53)
Ea craw eke
de welaniliie 2 cok 2
is the covariance matrix. Here, assume that ® is positive definite so that ®~ ' exists for any finite
L. Let T denote the unitary modal matrix whose columns are the orthonormal eigenvectors
of ® with eigenvalues j,, 42, ..., A,. Since ® is positive definite, the eigenvalues are positive.
Letting
i, os
A,
a -. (8.2.54)
0 z
we have
ea TAL nat ek ee (8.2.55)
Define transformed source and representation vectors by & = ul and ¥ = vI, where now
ii has covariance matrix A and probability density
Q,(a) = I Pt op (8.2.56)
Note that the components of i are independent random variables. In addition, since [7 = I we
have
dy(@ ) => fa — 4?
= (— sya - 4)
= aC — v)][I'7(u — v)"]
= 7 (u-vu-vy
ye) (8.2.57)
For any conditional probability density P,(v|u), let P,(¥|&) be the corresponding density for ¥
conditioned on a. Since I is an invertible mapping, it preserves average mutual information
WP)= | | Ou(aPy(ela) in POM
=| {7 Oy, 6a) in 21
-x _
= 1(P,) (8.2.58)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 49]
Also, d,(u, v) = d,(a, ¥) implies
DP,)={ | Qulw)Px(v|u)d,(u, ») dv du
=| { O,@)P,@|a)4,(@ ¥) a da
= D(P,) (8.2.59)
Thus R,(D) can be expressed in terms of the transformed space
R,(D)= __ inf J 1(P,) (8.2.60)
P, €FAp 1 L
where
Py = {P,(¥|a): D(P,) < D} (8.2.61)
Since u € #, has independent components, we can regard R,(D) as the rate distortion function of
a vector source with independent components where the /th component source output has
density function
Q(i,) = 7 87/24, (8.2.62)
/ 2nd,
In Lemma 8.1.1, R,(D) in nats per sample is given by the parametric equations
1 L
D,=— > D? (8.2.63)
L i=;
and
1 L
R,(D,) => Y RD!) (8.2.64)
l=1
where R“(D°”) is the rate distortion function of the Ith component source. The example in Sec. 8.1
gives for this case the parametric equations for R,(D) and for parameter 6 = —1/2s > 0
1 L
D,=— > min (9, d,) (8.2.65)
L i;
and
{2 Ay
R,(D,) = — > max (0.3 In — (8.2.66)
a i=1 6
We are now ready to pass to the limit L— o0. To do this we need to use the well-known
limit theorem for Toeplitz matrices.
Theorem 8.2.3: Toeplitz distribution theorem Let ®, be the infinite covari-
ance matrix. The eigenvalues of ®, are contained in the interval 6 <A <A,
where 6 and A denote the essential infinum and supremum, respectively, of the
function
%(w)=.-) oe *” (8.2.67)
492 SOURCE CODING FOR DIGITAL COMMUNICATION
Moreover, if both 6 and A are finite and G(A) is any continuous function of A,
then
a GA) os Ps | * G[O(c)] do (8.2.68)
L>o l=1 ~
where A‘) is the Ith eigenvalue of ®, the L x L covariance matrix.
ProoFr See Grenander and Szego [1958, sec. 5.2].
Applying this theorem to (8.2.65) and (8.2.66), we have the parametric equation for R(D)
given by
D,= = | Iain © Olpyeae (8.2.69)
7
and
1 cg
R(D,) = yn | max
0, In el da (8.2.70)
For this Gaussian source with the squared-error distortion measure, we now prove a coding
theorem, in essentially the same way as was done earlier for memoryless sources, by encoding
the transformed source output sequence u=ul €&, with M _ codewords denoted
B=¥,, V2, ..., Vy}. For any conditional density function
L
P,(8|a) = [] PC,|%) (8.2.71)
l=1
we follow the coding theorem of Sec. 7.2 to get a bound on the average distortion (over code and
source ensemble)
d(Z) < : y D® + do em L/2ELR. 0 Pu) (8.2.72)
where
D® = . ie (8 — a)? P(3|a)Q(a) dd di (8.2.73)
dy=./3 max, (8.2.74)
IsIsL
E,(R, p, P,) = —pR— > In i af ({ ‘orf “ Pugoutilay*” di) da (8.2.75)
L
= [] PP) (8.2.76)
l=1
and
i) (8.2.77)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 493
For a given parameter 0, choose (Tan [1975])
d(v,) A, < 6
PO, |4)= 1 e- @r- Ban?/269 4 Sg (8.2.78)
./ 2nB,0
where
0
B,=1-— (8.2.79)
A,
This choice yields
D® = min (8, 4,) (8.2.80)
and
l< I, P, [1 + p)])
E,(R, p, P;) = —pR—- = max (0,—In 8.2.81
u ) p L& 5 beset ( )
Thus
a ae L ee |, P (1+ p)|\
d(Z)<— min (6, A,) + dy ex -5(- R-—— max 0 fin [PO * a) } 8.2.82
(A) L& (8, 4,) 0 P| 5) p L = > 34 61) ( )
The Toeplitz distribution theorem gives
1
lim — y min (6, 4,) = D (8.2.83)
L>a l=1
and
see Sg ke + a
lim — ) max (0, — In = E_(p, 0 8.2.84
L> a LX | 2 A, + p@ ] te, 8) ( )
where
| OL +p)
E,(p, 0) = — — 0, p | wf
xo(P, 9) in) | p In Sah x of 4 0 | dw (8.2.85)
Here E ,(p, 8) has all the usual properties given in Lemma 7.2.2 where
0E.,(p, 8
tim CE: 9) _ Rw, (8.2.86)
p-0 op
Defining
E(R, D,)= max [—pR+ E,(p, 4)] (8.2.87)
- i<p<0
we have that for each «, > 0 and €, > 0 there exists an integer No(0, R, €,, €,) such that for each
L > Ng there exists a block code # of rate R and block length L such that
d(B) < Dy + €, + /3 AeW L/2HER, Do) e2] (8.2.88)
where E(R, D,) > 0 for R > R(D,).
This bound gives the rate of convergence to the rate distortion limit
(D,, R(D,)) and can be generalized to continuous-time Gaussian sources and
494 SOURCE CODING FOR DIGITAL COMMUNICATION
Gaussian image sources together with the squared-error distortion measures. Ex-
plicit evaluation of the rate distortion function
R(D) = lim R,(D)
L~>o
as was done in this example is generally possible only if R,(D) can be expressed as
the rate distortion function of a vector source with independent components and a
sum distortion measure. Otherwise we must settle for bounds on R(D).
8.3 BOUNDS FOR R(D)
For sources with memory, R(D) is known exactly only for a few cases, primarily
those involving Gaussian sources with a squared-error distortion measure. Easy-
to-calculate bounds to R(D) are therefore very important for general stationary
ergodic sources. Lower bounds particularly are useful since they represent limits
below which one cannot encode within the desired fidelity.
Recall that for stationary ergodic sources
< R,(D) (8.3.1)
Hence a trivial upper bound is R,(D), which may be found analytically or by using
the computational methods of App. 7A. For a squared-error distortion measure,
there is a more general version of Theorem 7.7.3 which shows that the Gaussian
source has the largest rate distortion function.
Theorem 8.3.1 For any zero-mean stationary ergodic source with spectral
density
O(w) = ee , en ik? (8.3.2)
where ¢, = E{u,u,,,} and the squared-error distortion measure, the rate dis-
tortion function R(D) is bounded by
Po D(w)
R(D) < | max Oar | do (8.3.3)
where 6 satisfies
1 Tt
= d 8.3.4
D a8 | min [0, ®(~@)] dw (8.3.4)
seat 3
That is, for a given spectral density, the Gaussian source yields the largest rate
distortion function.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 495
Proor Recall from (8.2.55) that ® = TAI’, and A = diag (A,, ..., A,) is the
diagonal matrix of eigenvalues of ®. The components of i = ul are uncor-
related random variables with covariance matrix A. The rate distortion func-
tion R,(D) can be expressed in terms of these transformed coordinates. Now
recall from Theorem 8.1.1 that
R,(D) < R,(D) (8.3.5)
where R,(D) is the rate distortion function obtained if the coordinates of
ad = ul are independent. R,(D), on the other hand, is given by the parametric
equations (see Lemma 8.1.1)
1 E
D,=— > D® (8.3.6)
L i= 1
and
m :
R,(D,) = = Y R°(D®) (8.3.7)
i=1
where R"(D°) is the rate distortion function of the /th component source with
the squared-error distortion measure. From Theorem 7.7.3, there is the fur-
ther bound
R®(D) <4 max (0 In a (8.3.8)
Thus
R,(D) < R4(D) (8.3.9)
where R#(D) is the rate distortion function for the Gaussian source with the
same spectral density. Taking the limit as L— oo, we get the desired result.
Thus we have shown that one general upper bound on R(D) is simply R,(D),
the first-order rate distortion function, while for the squared-error distortion, a
bound can be obtained which is the known rate distortion function of a Gaussian
source with the same covariance properties. Lower bounds can also be found by
generalizing the lower bounds for memoryless sources.
Suppose we have a continuous-amplitude stationary ergodic source and some
distortion measure. Let R,(D) be its Lth-order rate distortion function. The L-
dimensional version of Theorem 7.6.3 is
R,(D) => “Sap sp +7 fi. ve {a Q,(u) In A,(u) du} = (8.3.10)
ss 0, AL € As, z
where
fe)
A,,.= ctu) . — A,(u)Q, (ule du< 1, ve re (8.3.11)
— @
496 SOURCE CODING FOR DIGITAL COMMUNICATION
Now choose
a,(u) = + — [] A(u)otu) (8.3.12)
b s<O,AeEAs =O)
~ R,(D) + h(t) — n(av) (8.3.13)
where
h(4,) = —| i i Ok So io (8.3.14)
This results in the following theorem.
Theorem 8.3.2 For a stationary ergodic source with rate distortion function
R(D) = lim R,(D)
L>o
there is the lower bound
R(D) > R,(D) +h — h(@) (8.3.15)
where
n(w)=—| Qu) In Qu) du (8.3.16)
is the first-order differential entropy rate of the source and
1 oe) 00 :
h= — lim - | a Q,(u) In Q,(u) du (8.3.17)
L> © — EK,
is the differential entropy rate of the source.
Proor Take the limit as L— oo in (8.3.13). The limiting value h exists and is
approached monotonically from above (see Fano [1961]).
In this general lower bound, we express the bound in terms of R,(D), which
can usually be found by computational methods. Of course, further lower bounds
exist for R,(D), as described in Sec. 7.7. For difference distortion measures there
is also a generalized Shannon lower bound (see Prob. 8.7) given by®
le @)
| et) dz
— @
where D, is the distortion level associated with parameter s.
R(D,) > R,,(D,) = h + sD, — In (8.3.18)
© We assume
fe @)
| e4) dz < 0
aed
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 497
Corollary 8.3.3 For a stationary Gaussian source with spectral density func-
tion ®(w) and the squared-error distortion measure, the rate distortion func-
tion, R(D), is bounded from below by
R(D) > Ri9(D) =} In = (8.3.19)
where
E=exp 5s - In ®(w) do} (8.3.20)
is both the entropy rate power and the one-step prediction error of the Gaus-
sian source (see Grenander and Szego [1958, chap. 10]). Moreover, R(D) =
R,,(D) for D < 6, where 6 is the essential infimum of ®(a).
ProoFr For the Gaussian source where
R,(D) =41n = (8.3.21)
we have
h(%) = 4 In (22e07) (8.3.22)
and h = 3 In (2neE) (8.3.23)
(see Prob. 8.8). Thus
R,3(D) = R,(D) + h — h(%)
2
=} In - + 4 In (2meE) — 5 In (27e07)
E
=$ln — 8.3.24
sin = (8.3.24)
From (8.2.69) and (8.2.70), we see that if 8 < 6 then
Lees A
Dox a pe min [0, ®(@)] dw
7 (8.3.25)
and
: ®
R(D) == | max 0 In e daw
hivzk D(a)
aoe iB In 9 dw
at 2, In ®(@) do — In D
eae g 0 oes: Wien as
E
=tIn — 8.3.26
2 in D ( )
498 SOURCE CODING FOR DIGITAL COMMUNICATION
As just shown in the Gaussian case, the generalized Shannon lower bound
given by (8.3.15) is often equal to R(D) for a range of small D. The examples of
Sec. 7.7 imply that in most cases the Shannon lower bound is fairly tight for all D.
8.4 GAUSSIAN SOURCES WITH SQUARED-ERROR
DISTORTION
Up to this point we have always assumed that the source probability distributions
and the distortion measure are given. In practice, statistical properties of real
sources are not known a priori and must be determined by measurement.
Typically only the mean and correlation properties of a source are available.
These first- and second-order statistics of a source are sufficient to completely
characterize a source if it is Gaussian, an assumption which is often made in
practice. In many cases one can justify the Gaussian assumption with a central
limit theorem argument. The choice of distortion measure depends on the applica-
tion, and again it is usually not known a priori. In speech and picture compression
applications, for example, there have been evaluations of various distortion meas-
ures based on subjective fidelity ratings of compressed speech and pictures. In
practice, the most commonly used distortion measure is the squared-error
distortion.
For the most part in data compression practice, the sources are assumed to be
Gaussian and the distortion measure is assumed to be squared error. Theorem
8.3.1 shows that, for the squared-error distortion measure, the Gaussian assump-
tion results in the maximum rate distortion function. Thus for a given fidelity D,
the value of the rate distortion function of the Gaussian source R(D) is an achiev-
able rate regardless of whether or not the source is Gaussian. Another important
point is the fact that the Gaussian source with the squared-error distortion meas-
ure is the only example where the rate distortion function is easily obtained for all
sorts of generalizations. These serve as a baseline with which various compression
techniques can be compared. We look first at quantization of a memoryless Gaus-
sian source and compare the resulting averaged squared error with the corre-
sponding distortion that is achievable according to the rate distortion function.
Then we examine more general Gaussian sources with memory and find expres-
sions for their rate distortion functions.
8.4.1 Quantization of Discrete-Time Memoryless Sources
We begin with the simplest of sources, the discrete-time memoryless Gaussian
source with the squared-error distortion measure, where the rate distortion func-
tion is given by (7.7.20)
Here the source outputs are independent Gaussian random variables with zero
mean and variance co”. For this example, R(D) represents the minimum rate
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 499
required to achieve average squared-error distortion D and, as shown by Theorem
7.7.3, even for non-Gaussian sources, R(D) given above represents an achievable
rate for the squared-error distortion measure.
The simplest and most common data compression technique is quantization
of the real-valued outputs of the source. An m-level quantizer, for example, con-
verts each source output u € @ into one of m values q,, gz, ---; Gm- This can best
be described in terms of thresholds T,, T>, ..., T,,-, where the source output
u € & is converted to q(u) € {q1, q2, ---, Gm} according to
| q1 u<T,
q(u) =< q, 2 eas, 7 Ee ae ee ee | (8.4.1)
es |
The m-level quantizer converts each source output independently of other outputs
and yields an average distortion
m ah iy 1
D,, = u— q,)? —=— e 7/2? du 8.4.2
py ‘\ 41) ./2n0? ( )
where we take 7, = —oo and T,, = oo. Since there are m quantized values, this
requires at most R,,, = In m nats per output of the source to send over the channel
the exact quantized values. In Fig. 8.2, we plot the theoretical limit (D, R(D))
together with (D,,, R,,) for various values of m. Here we take the values of {q;, q2,
.-+, Gm} and thresholds {T,, ..., T,,-,} that minimize D,, as determined by Lloyd
[1959] and Max [1960].
Sk. Uniform quantizers, optimized
and coded (Goblick and Holsinger)
MS. % @Lloyd-Max quantizers,
uncoded
ca
ae
If
0 ae oe SST oa oes te a BES po til oe We Be 8 > D/o?
10-4 10-3 10? 107! I
Figure 8.2 Quantization techniques.
500 SOURCE CODING FOR DIGITAL COMMUNICATION
The quantization technique can be improved by observing that quantization
level q, has probability
—u2/202 du
eo
2 o( 7+] s o( *) (8.4.3)
oO oO
and the entropy of the quantized values is
H
oe
l
<Inm (8.4.4)
P, In P,
=1
We can encode without distortion (see Chap. 1) the quantized source outputs with
rate arbitrarily close to H,,. Goblick and Holsinger [1967] investigated the mini-
mization of H,, by varying m, {q,}, and {Tj}, subject to the requirement that
D,, < D for uniform T; — T;.,. Their results consist of a family of uniform
quantizers whose performance envelope is shown in Fig. 8.2. Quantization with
distortionless coding of the quantized source outputs results in a required rate
which is only about 0.2 nats per source output more than the theoretical limit
given by the rate distortion function R(D) = 4 In (a7/D). This is not too surprising
since the source is memoryless. Also, the distortionless source coding of the
quantized values requires both memory and the use of codewords similar to the
procedure for encoding the source directly with a fidelity criterion. If distor-
tionless coding is not used, then the performance gets worse as rate increases, as
shown by the Lloyd—Max quantizers in Fig. 8.2.
Although our example is based on the Gaussian memoryless source with
squared-error distortion measure, for a large class of memoryless sources and
distortion measures, the simple quantization technique should result in perform-
ance close to the theoretical rate distortion limit. Quantization followed by
distortionless coding of the quantized values will further improve the perform-
ance. This example points out the fact that, in practice for most memoryless
sources, quantization is an efficient technique. For real-valued sources with
memory and for more general sources, quantization by itself is no longer
adequate.
8.4.2 Discrete-Time Stationary Sources
Consider next a discrete-time stationary (ergodic) Gaussian source with output
autocorrelation
Dp, ma E{u,u, +} all t, k (8.4.5)
For the squared-error distortion measure, the rate distortion function is given in
terms of parameter 6 in (8.2.69) and (8.2.70)
n . ®
2 = = ke min [6, ®(w)|] dm and R(D,)= 7 [ max |0, In me | dw
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 501
where
Oo)= Y de® (8.4.6)
Recall from the example in Sec. 8.2 that the above rate distortion function was
derived by considering the encoding of transformed source outputs. In particular,
for large integer N, let I be the unitary modal matrix of the correlation matrix
®= {Pii- shi j= 1
and transform each source output sequence of length N, ue Ay, into u=ul.
The components of u are uncorrelated (also independent for this Gaussian
source). The Nth-order rate distortion function can be determined by regarding
each component of t as an independent output of a memoryless source, where an
output occurs each time N actual source outputs occur. There is no loss in
encoding the transformed variables, since the transformation preserves the
squared-error distortion measure. That is, for = ul’ and ¥ = vI, we have
ds(a, #) = 5 a — 6)?
I 2
= |u-v
= dy(u, v) (8.4.7)
We have already shown in Sec. 8.4.1 that, for a memoryless Gaussian source,
quantization of the source outputs is an efficient way to encode. For a Gaussian
source with correlated outputs, this suggests that we should first transform the
source output sequence into an uncorrelated sequence and then quantize. This is
in fact the most common data compression procedure. We may argue intuitively
that since we have an efficient and simple data compression technique for mem-
oryless sources, we should first “whiten” the source output sequence and by so
transforming it, obtain a memoryless (uncorrelated) sequence which can thus be
efficiently encoded by quantization. The transformation should be chosen so as to
preserve the distortion measure. For example, let T be an invertible transforma-
tion so that the output sequence u is transformed into the uncorrelated sequence
i = uT: Let q be the quantized sequence of and assume this is sent over the
noiseless channel. The decoder uses q = qT ' as the representation of the source
sequence u.
u
| quantization (8.4.8)
q
502 SOURCE CODING FOR DIGITAL COMMUNICATION
For the squared-error distortion measure, the unitary modal matrix of the covari-
ance matrix satisfies this requirement. Here, quantization may be slightly more
general in that different quantizers may be applied to different components of the
uncorrelated sequence U.
8.4.3 Continuous-Time Sources and Generalizations
Up to this point we have examined only discrete-time sources with source
alphabets which are sets of real numbers. Many common information sources
with outputs such as voice waveforms and pictures can be modeled as discrete-
time real-valued sources only if the source has been sampled in an appropriate
manner. In this section we take the approach of modeling all such more general
sources as discrete-time sources with abstract alphabets. For continuous-time
sources such as voice, for example, we consider sources that emit a continuous-
time waveform each unit of time. Thus each unit of time the discrete-time model
for a voice source emits an element belonging to the more abstract alphabet of
continuous-time functions. Picture sources or television can similarly be modeled
as a discrete-time source with the source alphabet consisting of pictures. Hence, by
allowing the source alphabets to lie in more general spaces, we can model more
general classes of sources.
The corresponding source coding problem for general sources modeled in this
manner can be formulated conceptually in the same way as for those with real
source alphabets. Defining appropriate probability measures on the abstract source
and representation alphabets and defining a distortion measure between elements
in these alphabets, Berger [1971] has formulated the problem in this more abstract
setting. The resulting rate distortion functions are defined in terms of mutual
information between source and representation alphabets in the same manner as
those given earlier for stationary ergodic sources with real alphabets. The main
difference lies in the more general probability measures required for the abstract
alphabets.
We do not attempt to prove coding theorems for discrete-time stationary
ergodic sources with abstract alphabets. Indeed, we will not even define the corre-
sponding rate distortion function. Besides requiring some measure-theoretic
definitions, generally these rate distortion functions are difficult to evaluate and
are known exactly only for some special cases. In this section, we present only a
few of the known cases for which the rate distortion function can be evaluated by
reducing the source outputs to a countable collection of independent random
variables, and where the distortion measure can be defined in terms of these
representative random variables.
Before proceeding with various examples we point out that, although we can
derive rate distortion functions for sources with abstract alphabets, to achieve the
limiting distortions implied by these functions requires coding with codewords
whose components are elements from the abstract representation alphabet. In
practice this is usually too difficult to accomplish. The rate distortion function
does, however, set theoretical limits on performance and often motivates the
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 503
design of more practical source encoding (data compression) schemes. The Gaus-
sian source with squared-error distortion which is presented here represents the
worst case for the commonly used squared-error criterion. This and the sub-
sequent examples are often used as standards of comparison for practical data
compression schemes.
Continuous-time Gaussian process, squared-error distortion Consider a source that
emits the zero-mean random process of T seconds duration, {u(t): 0 <t < T}. As
we stated above, our approach is to model this source as a stationary ergodic
discrete-time source with source alphabet consisting of time waveforms of dura-
tion T. Assume the energy of the output samples to be finite and choose the source
alphabet to be
ey &
U = ut) | u*(t) dt < | (8.4.9)
i i
and the representation alphabet to be
May
T= Je) | v?(t) dt < | (8.4.10)
“0
That is, our abstract alphabets are YW = ¥ = L,(T), the space of square-integrable
functions over the interval 0 < t < T, and the distortion measure
d;:Ux V —[0, «) (8.4.11)
satisfies a bounded second moment condition. The rate distortion function is de-
fined as a limit of average mutual information defined on abstract spaces Wy
and ¥',. For stationary ergodic discrete-time sources with these alphabets, there
are coding theorems which establish that the rate distortion function does in fact
represent the minimum possible rate to achieve the given distortion.
Modeling sources which generate continuous-time random processes as
discrete-time sources is somewhat artificial since we do not assume continuity of
the random process between successive source outputs (see Berger [1971]). Rather,
we usually have a single continuous random process of long duration which we
wish to encode efficiently. Still, in our discrete-time model, by letting the signal
duration T get large, we can usually reduce the source to a memoryless vector
source with outputs of duration T. This is analogous to the arguments in the
heuristic proof of the coding theorem for stationary ergodic sources given in
Sec. 8.2. When we assume the discrete-time source is memoryless, then the rate
distortion function depends only on the single output probability measure,
namely on the space @ x ¥ and the distortion dy: Y¥ x ¥ —[0, oo). We denote
this rate distortion function as R;(D).
Even with the memoryless assumption, the rate distortion function R,(D) is
difficult to evaluate. The key to its evaluation is the reduction of the problem from
one involving continuous-time random processes to one involving a countable
number of random variables. A natural step is to represent the output and rep-
504 SOURCE CODING FOR DIGITAL COMMUNICATION
resentation waveforms’ in terms of an orthonormal basis { f,(t)} for L,(T) such
that
Rij Y UP REy © 0 SYST (8.4.12)
and
oth= ) wilt) Os tre fF (8.4.13)
for any ue @ and ve ¥ . If now the distortion measure dp: ¥@ x VY >[0, oo) can
be expressed in terms of the coefficients {u“ and {v}, then R,(D) is the rate
distortion function of a memoryless source with a real vector output. Earlier in
Sec. 8.1, we examined such rate distortion functions for the sum distortion
measure
d(u, vl) = ¥ du, vo) (8.4.14)
k=1
All known evaluations of R,(D) involve reduction to not only a memoryless
vector source with a sum or maximum distortion measure, but to one having
uncorrelated vector components. This can be easily accomplished by choosing the
basis { f,} to be the Karhunen—Loéve expansion of the source output process. That
is, choose the f,(t) to be the orthonormal eigenfunctions of the integral equation
ft
| ot, s)f(s)ds=Af(t) O<t<T (8.4.15)
0
where @(t, s) = E{u(t)u(s)} is assumed to be both positive definite and absolutely
integrable over the rectangle® 0 < s, t < T.
For each normalized eigenfunction f,(t), the corresponding constant A, is an
eigenvalue of @(t, s). This choice of orthonormal basis yields the representation?
u(t) = ¥ uw f.(t) (8.4.16)
where :
Eee = d,0y, AO Rd 8, oe 3s (8.4.17)
The choice of distortion measure is not always clear in practice. Yet, even though
we are concerned with encoding a random process, there is no reason why we
cannot choose distortion measures that depend directly on the expansion
’ For source output {u(t): 0 < t < T}, this representation holds in the mean square sense uniformly
in t € [0, T].
8 This is a sufficient condition for the eigenfunctions {f,} to be complete in L,(T). However,
completeness is not necessary, for we can, without loss of generality, restrict our spaces to the space
spanned by the eigenfunctions.
° Without loss of generality, we can assume A, > A, >-:: . If {u} are mutually independent, this
representation holds with probability one for each t € [0, T].
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 505
coefficients of the random process with respect to some orthonormal basis.
Indeed, practical data compression schemes essentially use this type of distortion
measure. The squared-error distortion measure lends itself naturally to such a
choice for while d;: u x v > [0, 00) is given by
d,(u, v) = = | [u(t) — v(t)]? de (8.4.18)
it may also be expressed in terms of the Karhunen—Loéve expansion coefficients
oo
d_( aX (u® — yp)? (8.4.19)
The rate distortion function R,(D) is thus the rate distortion function of a mem-
oryless vector source with uncorrelated components and a sum distortion meas-
ure. It follows from Lemma 7.7.3 and Theorem 8.1.1 that R;(D) is bounded by the
corresponding rate distortion function for the Gaussian source. Thus from (8.2.65)
and (8.2.66), we have
1 @
R,(D) <— }) max (0. 3 In ) (8.4.20)
r k=1 6
where 6 satisfies
D= bi ‘3 min (6, A,) (8.4.21)
Pysi
Here (8.4.20) becomes an equality if and only if the continuous-time random
process is Gaussian. Further, if we now let T — co and we assume the source
output process is stationary with spectral density
co
O(w) = | o(t)e~** dr (8.4.22)
ots...
where $(t) = E{u(t)u(t + t)}, then based on a continuous-time version of the Toe-
plitz distribution theorem (see Berger [1971], theorem 4.5.4)'° we have
ee as O(c)
where 6 satisfies
1 fe @)
Pa 4.
rs : _min [6, ®(o)] de (8.4.24)
with equality if and only if the source output process is Gaussian.
1° This requires finite second moment, ¢(0) < 00, and finite essential supremum of ®(w).
506 SOURCE CODING FOR DIGITAL COMMUNICATION
Again we see that for the squared-error distortion measure the Gaussian
source statistics yield the largest rate distortion function among all stationary
processes with the same spectral density ®(w). The Gaussian source rate distor-
tion function
[e.6)
1
R9(D) =, | max
®
0, In (0)
7 | do (8.4.25)
“a
where @ satisfies (8.4.24) often serves as a basis for comparing various practical
data compression schemes.
Example (Band-limited Gaussian source) An ideal band-limited Gaussian source with constant
spectral density
o2
— |w| < 2nW
0 |w| > 2nW
yields the rate distortion function
o2
RD) = W In 77 0<D<o’ (8.4.27)
This is Shannon’s [1948] classical formula. It is easy to see that this is also the rate distortion
function for any stationary Gaussian source of average power o” whose spectral density is flat
over any set of radian frequencies of total measure W.
Gaussian images, squared-error distortion Information sources that produce pic-
tures (two-dimensional images) may be modeled as discrete-time sources with
outputs that are two-dimensional random fields represented by
'#
ut y); |x| ey, ly | <5 (8.4.28)
Images are usually described by the nonnegative image intensity function {i(x, y);
|x| < L/2, |y| < L/2}. We assume that the source output is u(x, y) = In i(x, y),
which is modeled here as a zero-mean Gaussian random field. In addition, if
u(x, y) and v(x, y) are any two-dimensional functions, we define the distortion
measure to be
L/2 L/2
| [u(x, y) — v(x, y)]? dx dy (8.4.29)
ee ie teas He?
1
d,(u, ve) = 7 |
The fact that we encode u(x, y) = In i(x, y), the log of the intensity function, with a
mean square criterion may appear somewhat artificial. There is, however,
evidence (see Campbell and Robson [1968] and Van Ness and Bouman [1965]) that
an observer’s ability to determine the difference between two field intensities
corresponds to the difference between corresponding transformed fields of the
logarithm of the intensities.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 507
Thus, for sources that produce two-dimensional images, we model our source
as a discrete-time source that outputs a zero-mean Gaussian random field. The
abstract source and representation alphabets are assumed to be
Lj2 = 1/2 l
| u*(x, y) dx dy < | (8.4.30)
—L/2
ee oe ut y):
= E/2Z
and we choose the squared-error distortion measure given by (8.4.29). If we
assume the discrete-time source is stationary and ergodic, then a rate distortion
function can be defined which represents the smallest rate achievable for a given
average distortion. First assume that the discrete-time source is memoryless. This
means that successive output images of the source are independent and the rate
distortion function R,(D) depends only on the probability measures on W x W
and the single output distortion measure given in (8.4.29).
For the memoryless case, evaluation of R,(D) is the natural generalization of
the continuous-time problem given above. We begin by defining the autocorrela-
tion function of the zero-mean Gaussian random field as
$(x, y; x’, y’) = Efu(x, y)u(x’, y’)} (8.4.31)
To be able to evaluate R,(D), we again require a representation of source outputs
in terms of a countable number of independent random variables, and again we
attempt to express our distortion measure in terms of these random variables.
With the squared-error distortion measure, any orthonormal expansion of the
source output random field will suffice. To have independent components,
however, we need the Karhunen-—Loéve expansion. We express outputs as
me L L
u(x y= Luly) |x| <5. lvl <5 (8.4.32)
k=1
where
re BP fe
u) = | | u(x, vy) f(x, y) dx dy (8.4.33)
“Spe = Bape
and { f,(x, y)} are orthonormal functions (eigenfunctions) that are solutions to the
integral equation
P£/2\. 2if2
M(xyy=} | Ol ys x, )F(%, y’) dx’ dy’ (8.4.34)
mee FE Pees 5 9
For each eigenfunction f,(x, y), the corresponding eigenvalue A, is nonnegative
and satisfies the condition’’
E{uu} = A,6,; rey = 12S: (8.4.35)
‘! Again we assume 4, >A, >--:. This representation holds with probability one for every
x, ye [—L/2, L/2].
508 SOURCE CODING FOR DIGITAL COMMUNICATION
As for the one-dimensional case, we assume that the autocorrelation (x, y; x’, y’)
satisfies the conditions necessary to insure that the eigenfunctions { f,} span the
alphabet space @ = ¥. Thus for any two functions in YW = Y, we have
ie @)
u(x, y)= ), uf(x, y) (8.4.36)
k=1
v(x, y)= Do F(X y) (8.4.37)
k=1
and the distortion measure becomes
1 [6]
d,(u,v)= >= DY (u® — vo) (8.4.38)
| a Pigras?
For this sum distortion measure, R,(D) is now expressed in terms of a memoryless
vector source with output u = {u", u, ...} whose components are independent
Gaussian random variables, with the variance of u“) given by A, , for each k. The
rate distortion function of the random field normalized to unit area is thus (see
Sec. 8.2)
ei A
ihe 1 ii
Pp Pein (0 5 In F (8.4.39)
where 6 satisfies
=— ) min (6, A,) (8.4.40)
k=1
Here R,(D) represents the minimum rate in nats per unit area required to encode
the source with average distortion D or less.
Since eigenvalues are difficult to evaluate, R,(D) given in this form is not very
useful. We now take the limit as L goes to infinity. Defining
RD) = limR,(D) (8.4.41)
L->o
we observe that R9(D) represents the minimum rate over all choices of Land thus
the minimum achievable rate per unit area. In addition, since for most images L is
large compared to correlation distances, letting L approach infinity is a good
approximation. To evaluate this limit we must now restrict our attention to
homogeneous random fields where we have
b(x, v3 x, Y) = @(x — x’, y—y) (8.4.42)
This is the two-dimensional stationarity condition and allows us to define a
two-dimensional spectral density function,
O(w,, w,) ={ He P(r, rye) dr, dry (8.4.43)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 509
Sakrison [1969] has derived a two-dimensional version of the Toeplitz distribu-
tion theorem which allows us to evaluate the asymptotic distribution of the
eigenvalues of sis This theorem shows that for any continuous function G(A)
im a y G,) = oe fs [e G[®(w,., w,)] dw, dw, (8.4.44)
L~>o
Applying this theorem to (8.4.39) and (8.4.40) yields
R(D) = lim R,(D)
L>o
ms = {| max 0 In Mee dw, dw, (8.4.45)
where 6 satisfies
D= = i ee min [0, ®(w,, w,)] dw, dw, (8.4.46)
As with our one-dimensional case, R9(D) is an upper bound to all other rate
distortion functions of non-Gaussian memoryless sources with the same spectral
density ®(w,, w,), and thus serves as a basis for comparison for various image
compression schemes.
Example (Isotropic field) An isotropic field has a correlation function which depends only on the
total distance between two points in the two-dimensional space. That is,
r,) = P(/r2 + 12) (8.4.47)
By defining r, 6,, w, and 6, as polar coordinates where
r,=rcos@, r,=rsin 6, (8.4.48)
and
w,=wceos6, w,=wsin 6, (8.4.49)
we obtain
B(w, 6,,) = O(w,, w,)
20 (@
a | | o(r, : r,je~ iw(r, cos @,,+7, sin 6,,) dr, dr,
oe —@
ee
lt | | one oe 6, cos 6,,+sin 6, sin Ow)p dr dé,
0 te)
,@ 22
sa | (n [ rr re ae, dr
“o “o
= 2n| @(r)J o(wr)r dr (8.4.50)
0
where J,( - ) is the zeroth order Bessel function of the first kind. Since there is no 6,, dependence
®(w) = 2n I rO(r)J o(wr) dr (8.4.51)
510 SOURCE CODING FOR DIGITAL COMMUNICATION
where
B(w) a O(w,., Wy) P(r) = P(r... ry)
w= ./we + we tN Loe os (8.4.52)
®(w) and A(r) are related by the Hankel transform of zero order.
For television images, a reasonably satisfactory power spectral density is
2Wo
B(w) =
(8.4.53)
we + w?
resulting in
(r) = ella (8.4.54)
where d. = 1/w, is the coherence distance of the field (Sakrison and Algazi [1971]).
For many sources successive images are often highly correlated so that the
above memoryless assumption is unrealistic. We now find an upper bound to the
rate distortion function of a discrete-time stationary ergodic source that emits the
two-dimensional homogenous Gaussian random field described above. Let the
nth output be denoted
i: L
u(x, y): |x] <5 ll <5 (8.4.55)
Again use the usual Karhunen-—Loéve expansion
fo @)
Un(X, y) ae 2B a K(X, y) (8.4.56)
k=1
where {f,(-,°)} and {A,} are eigenfunctions and eigenvalues which satisfy the in-
tegral equation of (8.4.34). By the assumed stationarity of the discrete-time source
with memory, the autocorrelation of the random field ¢(x, y; x’, y’) is inde-
pendent of the output time index n, and hence eigenfunctions and eigenvalues
are the same for each output of the discrete-time stationary ergodic source. We
now have a source that outputs a vector u, = (u(?, u!”’, ...) at the nth time.
The rate distortion function of the discrete-time stationary ergodic source is
given by
R,(D) = lim R; y(D) (8.4.57)
No
where R,, y(D) is the Nth-order rate distortion function [i.e., which uses only the
first N terms in the expansion (8.4.56)]. We can upper-bound R,(D) by the rate
required with any particular encoding scheme that achieves average distortion D.
Consider the following scheme:
1. Encode each Karhunen-Loéve expansion coefficient independently of other
coefficients.'* That is, regard the kth coefficient sequence {u*), u}, ...} as the
‘2 This amounts to partitioning the source into its spatial spectral components and treating suc-
cessive (in time) samples of a given component as a subsource which is to be encoded independent
of all other component subsources.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 511
output of a zero-mean Gaussian subsource and encode it with respect to a
squared-error distortion measure with average distortion D“.
2. Choose the distortions D”’, D®’, ... so as to achieve an overall average
distortion D.
The required rate for the above scheme, which we now proceed to evaluate,
will certainly upper-bound R,(D). Let us define correlation functions for each
subsource.
G(r) = Fu u® } (8.4.58)
and corresponding spectral density functions
ce @)
yw)= YL P™(r)e*” (8.4.59)
Consider encoding the sequence {u’, u?, ...} with respect to the squared-error
distortion measure. From (8.2.69) and (8.2.70), we see that for distortion D“ the
required rate -is
rT
R®(D™) = — | max o In oe dw (8.4.60)
rt 4
where 6 satisfies
Tt
|} aad = | min [0, W(w)] dw (8.4.61)
Here R“(D™) is in nats per output of the subsource.
Recall that the total patie distortion measure is
d,(u, BoA (u® — yp)? (8.4.62)
Hence, choosing {D“} such that
D=— Y. Dw (8.4.63)
will achieve average distortion D. The total rate per unit area is given by
2 R®(D”) (8.4.64)
Thus we have
Latiods 5% vw)
= 4.
R,(D) < 53 paar x x [0.1 ie lid (8.4.65)
where now we choose @ to satisfy
oO 1 -
D =5 Y 5 | min [6 y(w)] dw (8.4.66)
k=1 ore
We consider next a special case for which this upper bound is tight.
512 SOURCE CODING FOR DIGITAL COMMUNICATION
Example (Separation of correlation) Suppose the time and spatial correlation of source outputs
separate as follows:
E{u,(x, yUn+ Ax’, Y’)} = O(t)O(x — x’, y— y’) (8.4.67)
where ~(0) = 1.
Recall that any two Karhunen-—Loéve expansion coefficients u“) and uY). are given by
L/2 L/2
uM =| | al, yale ») dx dy
ee ee
i es (8.4.68)
W={ f taads v)Fioo y) dx dy
coe! FP Bees FP
Thus we have correlation
L/2 L/2
EuPu}=[ [Elta yneel’s YO VG 1) dx dy dx’ dy’
—Lj2 Lie
Lj2 L/2
me | a | P(t)b(x — x’, y— Y) A(x, v) F(x’, y’) dx dy dx’ dy’
-L/2 -L/2
L/2 ja
=O) | | A VIGO y’) de’ dy’
— Biase ee
= A, (t)6,; (8.4.69)
Hence
@*(t) = E{un uns}
= A,0(t) (8.4.70)
and for any k #j
Efu™u) }=0 for allt (8.4.71)
Since we have Gaussian statistics, the uncorrelated random variables are independent random
variables and the different Karhunen—Loéve expansion coefficient sequences can be regarded as
independent subsources. Lemma 8.1.1 shows that the upper bound given in (8.4.65) is in fact
exact, and we have for this case
12 1)" A, (w)
R,(D) = 5, pa e | _ max 0, In “| dw (8.4.72)
where @ is chosen to satisfy
p=1,¥ ~ | min (0, 4,00w)] d (8.4.73)
== = min [9, w)] dw A.
oy k=1 2n =F :
Using (8.4.44) in taking the limit as L — oo, we have the limiting rate distortion function given by
R(D) = lim R,(D)
L>o
ua
1 oo oo
aa a [max
O(w,, w,)b
hin vast dw, dw, dw (8.4.74)
where @ satisfies
1 pt p2 9@ :
Da aa | | min [8 O(w,, wd (w)] dw, dw, dw (8.4.75)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 513
and where ®(w,, w,) is given by (8.4.43) and
Hw)= ¥, Hee
t= .72-00
This example shows that the particular scheme of encoding expansion coefficients
independently of one another is an optimum encoding scheme when the time and
spatial correlations are separated as in (8.4.67). This general idea of taking a
complex source and decomposing it into independent subsources, which are
encoded separately, is a basic design approach for practical data compression
schemes.
8.5 SYMMETRIC SOURCES WITH BALANCED DISTORTION
MEASURES AND FIXED COMPOSITION SEQUENCES
In Sec. 7.6 we found that for symmetric sources with balanced distortion meas-
ures, the rate distortion functions are easily obtained in closed parametric form
[see (7.6.69) and (7.6.70)]. We now show that these symmetric sources with bal-
anced distortion measures have the property that, for fixed rate arbitrarily close
to R(D) and sufficiently large block lengths, there exist codes that encode every
source output sequence with distortion D or less. This is a considerably stronger
result than that stated in Theorem 7.2.1 which shows this only for the average
distortion. A similar strong result holds for the encoding of sequences of fixed
composition of an arbitrary discrete source and this will lead us in the next section
to the notion of robust source coding techniques that are independent of source
statistics. We begin by restating the definition of symmetric sources and balanced
distortion measures.
8.5.1 Symmetric Sources with Balanced Distortion Measures
A symmetric source is a discrete memoryless source with equally likely output
letters. That is,
ME = 404 Gs, .-5 Ay} (8.5.1)
where
|
A
O(a,) = RS 2 HA (8.5.2)
Assuming the same number of representation letters as source letters where
V- = {b,, b2, ..., b4}, for a balanced distortion measure, there exist nonnegative
numbers {d,, d,, ..., d4} such that
{d(u, b,), d(u, bz), ..., d(u, b,)} = {d,, dz, ..., dy} for allue W
and (8.5.3)
{d(a,, v), d(az, v), ..., d(a,, v)} = {d,, dy, ..., dy} for allve VW
514 sOURCE CODING FOR DIGITAL COMMUNICATION
The rate distortion function R(D) is given parametrically by
A
> Ae
D = D, = *~—_ (7.6.69)
eek
2
A
R(D,) = sD, + In A—In (se) (7.6.70)
where s < 0 is the independent parameter.
Consider again the block source encoding and decoding system of Fig. 7.3. As
we did earlier, we prove a coding theorem by considering an ensemble of block
codes of size M and block length N. By symmetry in this ensemble, we choose
code B = {V,, V2, ..., Vy} with uniform probability distribution, that is
1 MN
P(Z) = (;) (8.5.4)
Here each code letter is chosen independently of other code letters and with a
uniform one-dimensional probability distribution. Furthermore, since the distor-
tion matrix is balanced, for fixed u € W, the random variable d(u, v) is independent
of u. That is, for any ue YW
1
Pr {d(u, v) = d,|uy= 7 Re ee (8.5.5)
This means that for any fidelity criterion D and any two source sequences
u, u’ € Wy we have
Pr {d,(u, v) > D|u} = Pr {d,(u’, vy) > Diu} (8.5.6)
This is the key property of symmetric sources with balanced distortion measures
which we now exploit.
Lemma 8.5.1 Given block length N, distortion level D> D,,;,, and any
source output sequence u € Wy, over the ensemble of codes & of block length
N and rate R > R(D)
Pr {d(u|Z) > D|u} < e MPO)
Spee N[R— R(D)+ 0(N)] (8.5.7)
where
and
F,(D) = Pr {dy(u, v) < D|u} (8.5.8)
is independent of ue Wy.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 515
Proor Let Z = {v,, V2, ..., Vy}. Then since code words are independent and
identically distributed, according to (8.5.6)
Pr {d(u|Z) > D|u} = Pr ‘is dy(u, v) > D |u|
Live
= Pr {d,(a, v,,.) > D: m= 1, 2; ..., M{u}
a I Pr {dy(u, ¥,,) > D|u}
[1 — Fy(D)]"
< e7 MFN(D) (8.5.9)
where the inequality follows from In x < x — 1.
Next note that, for fixed u € Wy
dy(u, v) = a y d(u,,, U,) (8.5.10)
is a normalized sum of independent identically distributed random variables.
In App. 8A we apply the Chernoff bounding technique to obtain for any ¢ > 0
Zz
(1 — aaa FAD) <e “ (8.5.11)
where s satisfies
> ae"
D -—¢=*—_ (8.5.12)
esdk
p>
We assume D > D,,;,, and choose ¢ > 0 small enough so that D — « > D,,,.
This guarantees that s is finite and converges to a finite limit as «> 0. In
particular, choosing
E= f|—— (8.5.13)
we have
In F,(D) < —R(D) (8.5.14)
From this lemma it follows immediately that the average distortion d(Z)
satisfies
d(Z) <p + dye ex? NIR— RUD) + oN) (8.5.15)
and hence that there exists a code # for which d(Z) also satisfies this bound.
Comparing this with Theorem 7.2.1, we see that this lemma is a stronger result
516 SOURCE CODING FOR DIGITAL COMMUNICATION
since the second term here is decreasing at a double exponential rate with block
length N, compared to the single exponential rate of Theorem 7.2.1. Another
observation is that Lemma 8.5.1 holds regardless of the source probability distri-
bution and is true even for sources with memory. This happens since we have a
balanced distortion matrix and assume a uniform distribution on the code
ensemble. Of course, when the source output probability distribution is not uni-
form, we cannot say that the R(D) of the symmetric source is the rate distortion
function. It is clear, however, that the rate distortion function of the symmetric
source, R(D), is an upper bound to the rate distortion functions of all other
sources with the same balanced distortion, since we can always achieve distortion
arbitrarily close to D with rate arbitrarily close to R(D). We consider this in
greater detail when we examine the problem of encoding source sequences of fixed
composition. We next prove the source coding theorem for symmetric sources
with balanced distortion measures.
Theorem 8.5.1 For a symmetric source with a balanced distortion measure
and any rate R where R > R(D), there exists a block code & of sufficiently
large block length N and rate R such that
dju|Z%)<D ~— forallue W, (8.5.16)
ProoFr For any code & of block length N and rate R, define the indicator
function
1 d(u|%)>D
® = Bo
(u|4) =|) d(u| 4) < D (8.5.17)
for ue Wy. Averaging ® over source output sequences gives
1
© Qn(u)®(u| 4) = Fy DY O(u| A) (Se)
Averaging this over the ensemble of codes yields
1
gv & Mul) = EY On(u\P(A)O(a| 2)
=) Qn(u) >, P(#)O(u| Z)
u B
= 2) Qn(u) Pr {d(u| 4) > Du}
< e. exp N[R— R(D)+ o(N)] (8.5.19)
where the inequality follows from Lemma 8.5.1. This means there exists at
least one code # for which
ge DOA) < ay YO]
<e cP N[R— R(D) + o(N)]
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 517
or
Y, D(a] B) < AXeW oP MR RD)+ oN (8.5.20)
The bound can be made less than 1 by choosing N large enough when
R > R(D). Then we have
Sy @(u|Z) <1 (8.5.21)
But by the definition (8.5.17), for each u, ®(u|#) can only be 0 and 1. Hence
(8.5.21) implies that ®(u| 4) = 0 for all u € Wy, which requires d(u| 4) < D
for all u.
Since (8.5.16) holds for all output sequences, we see that this theorem holds
for any source distribution {Qy(u): ue Wy} when R(D) is the symmetric source
rate distortion function and R > R(D). For any other source distribution, the
actual rate distortion function will be less than that of the uniform distribution.
8.5.2 Fixed-Composition Sequences—Binary Alphabet Example
There is a close relationship between symmetric sources with balanced distortions
and fixed-composition source output sequences of an arbitrary discrete source.
For sequences of fixed composition, we can prove a theorem analogous to
Theorem 8.5.1. Although this property is easily generalizable to arbitrary discrete
alphabet sources with a bounded single-letter distortion measure (see Martin
[1976]), we demonstrate the results for the binary source alphabet and error
distortion measure.
Suppose we have a source alphabet W = {0, 1}, a representation alphabet
VY = {0, 1}, and error distortion measure
d(k,j)=1—5,; for k,j=0,1 (8.5.22)
For u € Wy, define its weight as w(u) = number of Is in u, and define the composi-
tion classes,
€ (1) = {u: ue Wy, w(u) = I} PO oy (8.5.23)
with probabilities
Qu) = 2 A (8.5.24)
pac u=0
and corresponding rate distortion functions [see (7.6.62)]
l
R(D; Q”) = (x) — #(D) (8.5.25)
l l
< D<min |—, 1 —-— | ep ee
0<D<min (<. v) (ee, ee
518 SOURCE CODING FOR DIGITAL COMMUNICATION
Using the Chernoff bound (see Prob. 1.5) we have for the number of se-
quences in @y(I), denoted |@y(I)|
|@x(I)| = A err TR (8.5.26)
This means we can always find a code of rate
R>H (x) such that M = e%* > |@j/(I)|
which can uniquely represent each sequence in @y(/) and thus achieve zero distor-
tion. We shall encode some composition classes with zero distortion and others
with some nonzero distortion.
Let us now pick 06 such that 0 < 6 < In 2, pick fixed rate R in the interval
0 < R < In 2, and choose 0 < € < 0.3 to satisfy
He) <6 (8.5.27)
Observe that we can make ¢ and 6 as small as we please and still satisfy (8.5.27).
Let the binary distribution Q* satisfying Q*(1) <4 be defined parametrically in
terms of the rate R, ¢ and 6, as follows:
R = R(e; Q*) + 6
= #(Q*(1)) — H(e) + 6 (8.5.28)
Also let [* be the largest integer such that [*/N < Q*(1) <4. Then from Fig. 8.3
we see that for any fixed composition class @y(/) where either
I/N < I*/N or 1-—1/N <I*/N (8.5.29)
we have
l l hg fi
#(5,)= (1 x) <(G) < (1)) (8.5.30)
and
}@x(1)| = 8 . #4 aCe e. (8.5.31)
Thus for any composition class @,(1) for which #(I/N) < #(Q*(1)), we can find a
block code of rate R and block length N such that
R= #(Q*(1))— #(-) +6
> H#(Q*(1))
> (x (8.5.32)
and from (8.5.26)
M = eX® > el ¥OM > 1@,(D (8.5.33)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 519
H(x)
0.5
0.4
H(Q*)
R-6
0.3
0.2
0.1
Figure 8.3 Binary entropy relationships.
Therefore, since there are more representation sequences M than sequences in the
class, such a code can encode sequences from @,(/) with zero distortion where /
satisfies (8.5.29).
For any other fixed composition class @,(/) for which instead
* ok
hes Gre (8.5.34)
define D, > € to satisfy
= (5) ~ #(D,) +6 (8.5.35)
520 SOURCE CODING FOR DIGITAL COMMUNICATION
Such a D, can be found in the range € < D, < I/N. This is illustrated in Fig. 8.3. We
show next that, like our result for the symmetric source with balanced distortion
measure presented in Theorem 8.5.1, we can find a code of rate R such that all
sequences in @y(/) can be encoded with distortion D, or less. First we establish a
lemma analogous to Lemma 8.5.1 by considering an ensemble of block codes
B={V1, V2, ..-, Vy} Of block length N and rate R = (In M)/N with probability
distribution
P(@) S11 Ply
Ai!
rl rl P(v,, (8.5.36)
m=1 n=1
where
P(v) = 90) v0) + O%(1)P(o|1
(1 fe w)P%o|9) + < P(o| 1) (8.5.37)
and P(v|u) is the conditional probability yielding the rate distortion function
R(D;; O®) = 1(P),
Lemma 8.5.2 Let «>0, 6>0 and rate 6< R<In2 satisfy (8.5.27) and
(8.5.28). For a fixed composition class @y(I) satisfying (8.5.34), D, satisfying
(8.5.35), and any u € @,/(1), over the ensemble of block codes with probability
distribution (8.5.36)
Pridul@)> Dine Cae Veer ee | 58)
PROOF
Pr {d(u|Z) > D,|u € @y(I)}
= Pr {d,(u, v,,) > D,;; m= 1, 2, .... M|we ,{I)}
= [Pr {dy(u, v) > D,|we @x(I)}"
< eM Pr tdv(u, v) < Diluc En(D)} (8.5.39)
Here the key property we employ is that Pr {d,(u, v) < D,|u € @y(I)} is
independent of u € @y/(I), since only the composition determines the probabil-
ity distribution of dy(u, v), which is a normalized sum of independent (though
not identically distributed) random variables. The generalized Chernoff
bounds in App. 8A again suffice for our purpose. Here we have
4
Pr {dy(u, v) < D,|ue @y(I)} = (1 -- a a Q)~ €late/2) (8 5.40)
Substituting (8.5.35) into (8.5.40) and the result into (8.5.39) then gives us the
desired result.
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 521
It is easy to see that, for «€<0.3, we have #(e) > —e In (€/2) so that
6 > —€ In (€/2) > 0 (see Prob. 8.12). Hence the exponent [6 + € In (€/2)] > in
(8.5.38). From this lemma follows the desired result.
Theorem 8.5.2 Let « > 0, 6 > 0 satisfy #(€) < 6. For sufficiently large integer
N*, for any rate R in the interval 6< R<In2, and any composition
class @y(l) where N > N*, there exists a code &, of block length N and rate R
such that
d(u|%,)<D, for allue @,/(I) (8.5.41)
where D, satisfies
R= (x — #(D,)
when
and D, = 0 otherwise. Here Q*(1) < 5 satisfies
R= #(Q*(1)) -— H(E) + 6
ProoF For |/N ¢[Q*(1), 1 — Q*(1)], D, = 0 as a result of (8.5.33). Now for
any I/N e€[Q*(1), 1 — Q*(1)], suppose we have a source that emits only se-
quences from @,(/) with equal probabilities. For any block code # of block
length N and rate R, define the indicator function
O(u| 4) = jl d(u|#) > D,
= 8.5.42
lo dula)<pD, for u € €y/(I) ( )
Averaging ® over output sequences, we obtain
1 1
—_—_—__ @(u | Z) = -——__| O(u |Z) (8.5.43)
u ee | €y(I) | ( | ) | €y(I) | u vi |
Next consider an ensemble of block codes where code # = {v,, ¥>, ..., Vay} is
chosen according to the probability distribution (8.5.36) and (8.5.37). Averag-
ing (8.5.43) over this code ensemble yields
1 1
Acai ee GAD
» P(B)O(u| A)
1
NB cgay tt dela)e Dulac telD)
- e~ (1-4/Ne2) exp N[d+ € In (€/2)] (8.5.44)
522 SOURCE CODING FOR DIGITAL COMMUNICATION
where the inequality follows from Lemma 8.5.2. Using the bound
|@x(1)| < 2%, it follows that there exists a code Z, of block length N and rate
R such that
» Oul|4)< Y ul)
uc €n(I) ue @n(l)
2 QN e (1— 4/Ne?) exp N[6+ €1n (€/2)] (8 5 45)
a! =a
Choosing N* to be any integer for which the bound is less than one, it follows
as in the proof of Theorem 8.5.1 that ®(u|Z,) = 0 for all u € (I).
This theorem shows that given any 0 < 6 < In 2, rate R such that 6< R<
In 2, and 0 < € < 0.3 satisfying #(€) < 6, for any composition class @ (I) where
N > N*, we can find a block code, %,, of block length N and rate R such that
d(u|Z,)=0 for all ue @, (I) if #(I/N) < #(Q*(1)), and d(u|Z,) < D, for all
uc @y(1) if H(I/N) > #(Q*(1)) where Q* satisfies (8.5.28) and D, > « satisfies
(8.5.35) (see also Fig. 8.3).
It is natural to define the composite code
N
B= \)Z, (8.5.46)
1=0
which has (N + 1)e** elements (e’* for each of the N + 1 composite classes) and
hence rate
In (N+ 1
_m(N +1)
= (8.5.47)
For the code &-, we have
d(u|B,-)<D, = ifue @y/(I) (8.5.48)
where we take D, = 0 if #(l/N) < #(Q*(1)). We see that, as N > 00, Rc R, and
thus by choosing N large enough we can make the rate of the composite code #-
arbitrarily close to R.
Up to this point, the results depend only on the source alphabet and are
independent of the source statistics. The composite code Be satisfies (8.5.48)
regardless of the actual source statistics. Suppose, however, that our binary source
is memoryless with probability Q(1) = q <4 and Q(0) = 1 — q. Then the rate
distortion function for this source is R(D) = #(q) — #(D) for 0 < D < q. How
well does the composite code encode this source? The average distortion using the
composite code is
dBc) = >» Av(u)d(u| Bc)
» Qy(u)d(u| Bc)
N
p?
1=0 we €n(l)
N
—s >) laa 32,
1=0 ue @n(l)
N
= > (¥)q'(1 — 4)*'D, (8.5.49)
Il
o
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 523
As N increases, (j’)q'(1 — q)%~' concentrates its mass around its mean Nq. This
follows from the asymptotic equipartition property (McMillan [1953]) which says
that, as block length increases, almost all source sequences tend to have the same
composition. Thus we have (see Prob. 8.13 and Chap. 1)
N
lim > (¥)q'(1 — q)*"'D, = D (8.5.50)
N>o 1=0
where D satisfies (8.5.35) with | = Nq; that is
R = R(D; q)+ 6
= #(q)— #(D)+ 6
= R(D)+6 (8.5.51)
The code rate for A, then becomes
Roe =R(D)+ 6+ _ = !) (8.5.52)
Hence given any 7 > 0, we can find 6 small enough and N large enough so that
d(B-)<D+n (8.5.53)
and
Ro <R(D)+n (8.5.54)
Thus the composite codes can encode any memoryless binary source with error
distortion arbitrarily close to the theoretical rate distortion limit. This is a robust
source encoding scheme for memoryless sources in the sense that the same compo-
sition class code is efficient (near the rate distortion limit) for all such sources and
the composite code is constructed independent of actual source statistics.
The preceding example of a binary alphabet with the error distortion measure
can be generalized to arbitrary discrete alphabets and arbitrary single-letter dis-
tortions (see Prob. 8.14). Further generalizations are possible by considering fixed
finite sequences of source outputs as elements of a larger extended discrete
alphabet. In this manner, the robust source coding technique can be applied to
sources with memory (see Martin [1976]). The basic approach of considering a
single source as a composite of subsources and finding codes for each subsource in
constructing a total composite code is also used in encoding nonergodic station-
ary sources. This is referred to as universal source coding and is discussed in
Sec. 8.6.
We have demonstrated a similarity between symmetric sources with balanced
distortions and fixed composition classes. In general with any discrete alphabet,
for any fixed composition class, we may define a function R(D; Q) where Q is the
distribution determined by the composition. We can show that if R > R(D; Q)
and the block length is large enough, we can find a code that will encode all
sequences of the composition class to distortion D or less. Certainly if
R > max R(D; Q) (8.5.55)
Q
524 SOURCE CODING FOR DIGITAL COMMUNICATION
then every output sequence can be encoded with distortion D or less. Symmetric
sources with balanced distortions have the property that
R(D) = max R(D; Q) (8.5.56)
Q
Thus the symmetric source coding theorem (Theorem 8.5.1) is actually a special
case of the composition class source coding theorem (Theorem 8.5.2 appropriately
generalized to arbitrary discrete alphabets and any single-letter distortion meas-
ures). See Probs. 8.14 and 8.15 for generalizations and further details.
8.5.3 Example of Encoding with Linear Block Codes
We conclude our discussion with a coding example for the simplest symmetric
source with balanced distortion, the binary symmetric source with error distortion
measure. This example, due to Goblick [1962], shows that Theorem 8.5.1 is
satisfied with a linear binary code.
Let @™=V = {0, 1}, Q(0) = Q(1) =4, and d(k, j) = 1 — 6,;. The rate distor-
tion function is, of course, R(D) = In 2 — #(D) for 0 < D < 3. Now we consider
linear binary (N, K) codes for source coding where the rate is r = K/N bits per
symbol or R = (K/N) In 2 nats per symbol. First consider K binary sequences of
length N, {b,, b,, ..., bx}, which we call code-generator vectors. With these gener-
ator vectors we generate a sequence of codes of block length N and different rates
by defining for /] = 1, 2,..., K the subcodes
BI) = {v: v= C,b, (>) c,b, >) Oras So) c,b;} (8.5.57)
where the binary coefficients c,, c,, ..., c,; are all possible binary sequences of
length |. There are then 2' codewords in A(!). By defining the set
Bl; b)4,) = (vi: v=v @b,, 1, V € BI} (8.5.58)
we see that
Bl+1)=Z(l) v All; by 1) (8.5.59)
That is, code &(! + 1), which has rate (1 + 1)/N bits per symbol, is the union of
code A(!), which has rate I/N bits per symbol, and a “ shifted” version of this code
denoted A&(1; b,, :).
Generate the ensemble of linear binary codes obtained by randomly selecting
the generator vectors such that all components of all vectors are treated as
independent equiprobable binary random variables. Since there are N/ compon-
ents in the generator vectors b,, b,, ..., b,, the code A(/) has ensemble probability
distribution given by
P(A(I)) = G)™ (8.5.60)
Recall that u € Wy also has a uniform probability distribution so that over the
source and generator ensembles u and u@b are independent binary vectors.
(Check this for N = 1 and generalize.)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 525
The usual ensemble coding argument must be modified here to a series of
average coding arguments and a sequential selection of codeword generators,
Since code &(/ + 1) is constructed from code A(l) and another randomly selected
generator vector b,, ,, we have
Pr {d(u|A(! + 1) > D| A()} = Pr {d(u| A())) > D, d(u| A(I; b,. 1) > DIB}
(8.5.61)
where the probability is over the ensemble of uc Wy, and b,, , € Wy. But now
d(u| A(!; ee | min dy(u, v @b,.;)
ve Al)
= min d,(u@b,,,, V)
v' € Al)
= du@b,, ; | A(l)) (8.5.62)
and, since u and u@b,, , are independent of each other, (8.5.61) becomes
Pr {d(u| A(I + 1)) > D| B(D}
= Pr {d(u| A(1)) > D| AD} Pr {d(u@b,, , | A())) > D| AD}
= [Pr {d(u| A(l)) > D| A(D}]? (8.5.63)
The left side of (8.5.63) can also be written as an average over b,, ,
Pr {d(u| A(1 + 1)) > D| A(D}
= ) P(bis1) Pr {d(u| A(l + 1)) > Dl Al), bi+ 1}
bi+1
= Y P(b,,,) Pr {d(u| A(I + 1)) > DJ A(1 + 1)} (8.5.64)
bi+1
Hence given any code A(I), there exists a generator vector b,,, such that
Pr {d(u| A(1 + 1)) > D| A( + 1)} < Pr {d(u| A(l + 1)) > D| A(D}
= [Pr {d(u| A(l)) > D| A())}}? (8.5.65)
We can select a sequence of generator vectors b,, b,, ..., bx such that for each |,
(8.5.65) holds. Then for such a set of K generator vectors we have
Pr {d(u| A(K)) > D| A(K)} < [Pr {d(u| BO)) > D| ACO)"
= [Pr {d(u, 0) > D}]?”
ae [1 aa F,(D)]}*°
<< (8.5.66)
where we have used In x < x — 1 and defined F,(D) = Pr {d(u, 0) < D}. From
App. 8A, we have
F,(D) ie N{R(D) + o(N)}
526 SOURCE CODING FOR DIGITAL COMMUNICATION
so that there exists a code @(K) such that
Pr {d(ul @(K)) > DI BUR} <= ¢ oe mera (8.5.67)
where R = (K/N) In 2. Following the same argument as in the proof of Theorem
8.5.1, we see that by choosing N sufficiently large, for any fixed rate
R = (K/N) In 2 > R(D) = In 2 — #(D)
there exists a linear binary (N, K) code &(K) such that
d(u|A(K))<D for allue W (8.5.68)
Thus for a binary symmetric source and error distortion measure, a uniform
distortion condition is met by a linear code.
8.6 UNIVERSAL CODING
The source coding theorems of Sec. 8.2 were restricted to stationary ergodic
sources. The formal definition of R(D) given by (8.2.5), however, can also be given
for nonergodic stationary sources where Lemma 8.2.1 still applies. The converse
coding theorem (Theorem 8.2.1) applies to nonergodic stationary sources only if
we interpret average distortion as an ensemble average. The coding theorems,
however, do require that the sources be ergodic. One might expect that it would be
possible to prove coding theorems for arbitrary stationary sources and then show
that R(D) is indeed the minimum possible rate that can be achieved with ensemble
average distortion of D or less. We present next, however, a counterexample
which shows that R(D) given by (8.2.5) does not represent the minimum possible
rate necessary to achieve ensemble average distortion D for nonergodic stationary
sources.
Example (Gray [1975]) Consider a memoryless Gaussian source of zero mean and variance o”.
For the squared-error distortion measure d(u, v) = (u — v)’, the rate distortion function is given
by (7.7.20)
2
RD) = 3 In = nats/symbol 0O<D<o’ (7.7.20)
Next suppose we have any stationary source whose outputs are random variables (not necessarily
independent) with zero mean and variance o?. For the squared-error distortion, we can define the
function R(D) as given by (8.2.5). Lemma 8.2.1 shows that
R(D) < R,(D) (8.6.1)
where R,(D) is the rate distortion function for the corresponding memoryless source. From
Theorem 7.7.3, we have the inequality
R,(D) <4 1n — = RD) (8.6.2)
with equality if and only if the source single-letter probability density is Gaussian. Hence, for any
rate R, if we pick D, and D, to satisfy
R= R,(D,) = R(D,) < R(D;) (8.6.3)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 527
BA eg eee ae oe ae ete meee ae =
! |
| Memoryless wl eRy |
| Gaussian > 5
| (o] 1 |
| 1 Qy (u) >< | ueRy
|
| tf a
| ae
|
Memoryless weR N |
7 Gaussian > 1
| 92 Qj (u) |
| |
Sed aad oak oe OE abdles MRE ses ness Se ain ea mw ante athens sb
Figure 8.4 Composite source.
then we have
D, = o7e" 7" > D, (8.6.4)
with equality if and only if the memoryless source is Gaussian.
Now consider a composite source consisting of two memoryless Gaussian subsources, each
of zero mean. One subsource has variance a7 and the other subsource has variance 03 + a7. The
composite source has the output sequence of the first subsource with probability 4 and the output
sequence of the second subsource with probability 4. This source is sketched in Fig. 8.4. Hence
u € R, has probability density
1
ey ee ||| 2/202 ee ee
a 2(2n07)"? 2(2n03)N/2 ©
= 2 QN'(u) + 2 QV(u) (8.6.5)
which is clearly non-Gaussian when a7 + a3. The composite source has memory and is station-
ary. It is not ergodic.'? Its first-order density is
Olu) = 20 (u) + 20™(u) (8.6.6)
where
fe uQ(u) du=0
and ‘
ie u?Q(u) du = ; J on) du + : J 97%) du
= 30; + 203
or (8.6.7)
For the distortion d(u, v) = (u — v)*, we can define R(D) and R ,(D). For any rate R, we have from
(8.6.4) that D,, where R = R,(D,), satisfies
Ses Ds awe e tP SD,+ 6 (8.6.8)
where 6 > 0, since (8.6.6) is not a Gaussian density function.
13 The variance of any sample output sequence is either oj or a3, while the ensemble variance is
ieee oe ee ee
oO” = 30; + 303.
528 SOURCE CODING FOR DIGITAL COMMUNICATION
Let # be any block code of block length N and rate R. If this code is used to encode the
composite source, the ensemble average distortion is
io.) ic.2)
aa)=| | Qy(u) d(u| A) du
<5 Jf Ow) dtu)ea) ans ff OP (a) dul) ae
ee d,(f)+ 4 d,(Z) (8.6.9)
But
d,(Z) = LE es ie Q\(u) d(u|#) du (8.6.10)
is the average distortion for the zero-mean Gaussian source with variance a7. The converse
coding theorem states that
d,(Z) = aje 7" (8.6.11)
and similarly
d,(B) = aje 7% (8.6.12)
yielding
= io} + 03)e"7*
=D,+6 (8.6.13)
where 6 > 0, according to (8.6.8).
If R(D) represents the achievable rate for which we can encode the stationary composite
source with ensemble average distortion D or less, then given any ¢ > 0 we can find a block code
of rate’* R = R(D) such that
d(@)<D+e | (8.6.14)
But from (8.6.1) and (8.6.3) we have that R = R(D) = R,(D,) < R,(D) which implies
D<D, (8.6.15)
and so
d(Z)<D, +6 (8.6.16)
However, from (8.6.13),
D,+5<d(B)<D,+¢ (8.6.17)
which is a contradiction since we can choose ¢ < 6. Hence R(D) does not represent minimum
achievable rates for the stationary composite source.
The above counterexample shows us that the function R(D), although
definable for arbitrary stationary sources, has operational significance only for
stationary ergodic sources. It turns out, however, that a stationary source in
** In Corollary 7.2.2 we can replace R(D) + € by R(D) (see Prob. 7.4).
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 529
general can always be viewed as a union of stationary ergodic subsources (Gray
and Davisson [1974]). (In the above counterexample the source consisted of two
subsources.) This fact has led to the development of coding theorems for general
stationary sources without the ergodicity assumption. We illustrate this generali-
zation to nonergodic stationary sources with a simple example.
Example (Stationary binary source) Suppose we have a binary source which consists of L mem-
oryless binary subsources as shown in Fig. 8.5, where the /th subsource /, outputs independent
binary symbols with probability p, of a “1” output at any given time, for /] = 1, 2,..., L. The
composite binary source has as its output sequence the output sequence of one of its subsources.
It has a priori probability z,(/ = 1, 2, ..., L) of being connected to subsource , for all time.
Hence u = (u,,;, U,42,---> U;4y) has probability
P,(u) = y P,(u| S,)n,
=1
~
L
= oy 7m, py (1 = oo (8.6.18)
l=1
where w(u) is the number of “1”s in u. Clearly this binary source is a stationary source. It is not
ergodic since any sample output sequence (..., u_ 1, Ug, Uy, ...) has time average
1 N
lim a > 4,=P, (8.6.19)
No n=1
if it is the output of subsource Y,, whereas the ensemble expectation is
L
EXu,} = > ™Pi (8.6.20)
I=1
Pode hy. ad ME es ERED oh eNO AL Se Re LEAN yee ne :
|
| |
Ase Oe oe |
|
| Sy |
| Py.™ |
| |
| |
| |
ae Oe a! A Jos Uys Mos Uys.
sg —— —>
| Daa |
| |
| |
| |
| |
|
| 1,0, 1 |
|
| S 1, |
| |
ts g
Figure 8.5 Stationary nonergodic binary source.
530 SOURCE CODING FOR DIGITAL COMMUNICATION
Suppose we have the representation alphabet Y = % = {0, 1}, and error distortion measure
d(u, v) = 1 — 6,,,. What is the smallest average distortion we can achieve for this binary source?
Although R(D) can be defined in terms of (8.6.5), the previous example showed that R(D) does not
necessarily represent the minimum rate that can achieve average distortion D. We do know that,
given ¢ > 0, there exist block codes Z', BZ, ..., B’ of block length N and rate R such that the
first subsource can be encoded using code 4! with average distortion
d'(Z')<D't+e (8.6.21)
where D' satisfies
R = R(D'; p;)
= (p,) — #(D') (8.6.22)
Similarly, the /th source can be encoded using code 4! with average distortion
d'(B') <D' +e (8.6.23)
where D’ satisfies
R = R(D'; p,)
= #(p,) — #(D') (8.6.24)
In other words, for a given rate R and any ¢ > 0 we can find for each subsource a block code
which will give average distortion within ¢ of the smallest average distortion possible for that
subsource. The converse theorem applied to the /th subsource says we cannot do any better than
average distortion D’.
Suppose we construct a code for our nonergodic stationary binary source as the union of the
above codes designed for each subsource and denote this composite code,
B= |) a! (8.6.25)
l=1
This code has
Mc= LeXR (8.6.26)
codewords since there are e’® codewords in each of the subcodes #!, Z?,..., BY. The rate of the
composite code is thus
R- = (In M;)/N
In L
yy aan (8.6.27)
N
where, as N approaches infinity, (In L)/N converges to zero. For any source sequence of length N,
u = (u,, U2, ..., Uy), this code has distortion
d(u|B-) = mind,(u, v)
VE Bc
= min ‘ain d,(u, v), min d,(u, v), ..., min dy(u, v)|
Cia ve # ve Be |
= min {d(u|#'), d(u| A”), ..., d(u| B*)} (8.6.28)
Hence the average distortion using code &, is at least as small as is achievable with the know-
ledge of which subsource is connected to the output and using the appropriate subcode. That is
d(B;) <D'+e (8.6.29)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 531
if subsource , is connected to the output. Hence, for a fixed code rate and by choosing large
enough block lengths, the code 4, can have average distortion arbitrarily close to the minimum
possible average distortion.
In Sec. 8.2, we established the performance of the best possible encoding
methods for stationary ergodic sources. Generalizing on the above example, we
may show that when a source can be modeled as a finite collection of stationary
ergodic subsources, then by using good codes for each of the subsources to form a
composite code for the overall stationary, but not necessarily ergodic, source, we
can still achieve the minimum average distortion possible for a fixed rate. This
technique generalizes to a large class of nonergodic stationary sources, because
nonergodic stationary sources can generally be represented as a collection of
stationary ergodic subsources, characterized by an a priori probability distribu-
tion that any particular subsource output sequence is the total source output
sequence. Although for many sources the number of subsources thus required is
infinite, under certain topological conditions (on both the source and the distor-
tion measure) the collection of subsources can be approximated by a finite collec-
tion of subsources. Once the finite approximation is made, we can proceed as in
the above example. To illustrate this approach, we return to the binary example,
but now with an uncountable number of stationary ergodic subsources.
Example (Binary source with a random parameter) Consider a memoryless binary source where
the probability p of a “1” is a random variable with range between 0 and 1. We wish to encode
this source using the error distortion measure d(u, v) = 1 — 6,,. If p e [0, 1] were known we
would have a memoryless binary source which is stationary and ergodic. Because of the random
parameter p, the overall source is stationary but nonergodic. In order to reduce this problem to
the case of our previous example, we need to approximate the set of all possible subsources by a
finite set of subsources. To do this, we define a distance between two binary memoryless sources,
each with known but different parameters.
Let Y and ¥ be two binary memoryless sources with parameters p and p respectively. Let
Q(u, u) be any joint distribution such that
1
p= 3 Q(1, uv) = Q(1, 0) + Q(1, 1) (8.6.30)
and
y= >» Qu, 1) = Q(0, 1) + Q(1, 1) (8.6.31)
That is, let Q(u, i) be any joint distribution with marginal distributions corresponding to sources
¥ and ¥. Define the distance between the two sources as
d(p, p) = min YY Qu, tid(u, i) (8.6.32)
Ver &
where 2 is the collection of such joint distributions. Then
YL Olu, #)d(u, 4) = Q(0, 1) + Q(1, 0) (8.6.33)
since d(u, v) = 1 — 6,,. is the distance measure. It follows easily (see Prob. 8.17) that
d(p, p) = |p — P| (8.6.34)
532 SOURCE CODING FOR DIGITAL COMMUNICATION
Let Z be any block code of length N and let u € %, be an output sequence of length N from
source Y and tie W, be an output sequence from source Y. Let v(ai) € Z satisfy
d,(a, v(a)) = min d(i, vy).
Then
min d,(u, v) < dy(u, v(a))
< d,(u, a) + dy(u, v(a))
= d,(u, ) + min d,(t, v) (8.6.35)
veB
where the second inequality is the triangle inequality which this error distortion measure clearly
satisfies. By symmetry we then have
min d,(i, v) < d,(u, @) + min d,(u, v) (8.6.36)
veB veB
Now averaging either (8.6.35) or (8.6.36) with respect to the joint distribution Q(u, u) which
satisfies (8.6.30), (8.6.31), (8.6.32), and (8.6.34), we obtain
|d(B |p) — 4(B|p)| <d(p, b) = |p — B| (8.6.37)
where d(B |p) and d(@ |p) are the average distortions attained with code Z for source ¥ and F,
respectively. This “mismatch” equation tells us the maximum average distortion loss we can
have when applying a code designed for one source to another source. It allows us to make a
finite approximation to the source space since, when two sources are close in source distance
d(p, p), then a good code for one source is also good for the other. In addition, if R(D; p) and
R(D; p) are the rate distortion functions for the two sources, we can easily show (see Prob. 8.18)
that
R(D + d(p, p); p) < R(D; p) < R(D — d(p, B); p) (8.6.38)
Given any ¢ > 0 let us divide the unit interval into L equally spaced intervals of length less
than ¢, which requires L > 1/e. Let p,, p,, ..., p, be the midpoints of the L intervals. By construc-
tion |p, — pj+,| <6 and for any p € (0, 1] we have
min |p —p,| <« (8.6.39)
I
Hence for any subsource with parameter p there is a subsource with parameter in the finite set {p,,
P2,---, P,} Which is within “source distance” ¢. We now use subsources corresponding to these
parameters as the finite approximation to the uncountable set of subsources. Following the
results of our earlier example, we find codes 4', BZ’, ..., B" satisfying (8.6.21) to (8.6.24) and
define the composite code
B= |) (8.6.40)
For any subsource with parameter p « [0, 1] we have from (8.6.37) that the average distortion
d(B¢|p) < d(Bc|p*) + d(p, p*) (8.6.41)
where p* € {p,, pz, ---, py} such that d(p, p*) = |p — p*| <«. Then since d(@,|p*) < D* + €
[see (8.6.29)], we have
d(8-|p)<D*+er+e
= D* + 2¢ (8.6.42)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 533
where D* satisfies [see (8.6.24)]
ee R(D*; p*) (8.6.43)
For the source with parameter p, the smallest average distortion possible is D where
R = R(D; p)
= R(D*; p*) (8.6.44)
But from (8.6.38) we have
R(D* + €; p) < R(D*; p*) = R(D; p) < R(D* — ¢; p) (8.6.45)
and so
D*<D+e (8.6.46)
Thus, finally substituting in (8.6.42), we obtain
d(Z-|p) < D+ 3 (8.6.47)
The code rate for the composite code &- is
R-=R+ (In L)/N (8.6.48)
which approaches R as N > oo. This shows that for any fixed rate, regardless of the value of the
unknown parameter, p, we can use a single code #, to encode our binary source with unknown
parameter with an average distortion which is asymptotically equal to the minimum achievable
when the parameter is known.
The method of this example generalizes to a large class of nonergodic station-
ary sources and distortion measures. The basic idea is to first observe that all
nonergodic stationary sources can be represented as a collection of stationary
ergodic subsources (Rohlin [1967]). By defining a distance measure (see
Prob. 8.18) on the subsource space, we can often “carve up” this space into a
finite number of subsets with each subset of subsources approximated by a single
subsource. This finite approximation then allows us to design good codes for each
of the finite representative subsources and take the union of these as the code for
the actual source. If there are L such subsources, then the rate of the composite
code is at most (In L)/N larger than the rate for each subcode. For sufficiently
large N, this additive term is negligible.
Universal coding refers to all such techniques where the performance of the
code selected without knowledge of the unknown “true” source converges to the
optimum performance possible with a code specifically designed for the known
true source. The technique of representing or approximating a source as a finite
composite of stationary ergodic subsources and forming a union code is one of
several universal coding techniques. Another closely related technique involves
using a small fraction of the rate to learn and characterize the stationary source,
and then using the rest of the rate in encoding the source outputs. Earlier, in
Sec. 8.5, we considered a stronger robust coding technique for finite alphabet
534 SOURCE CODING FOR DIGITAL COMMUNICATION
sources wherein the source outputs were classified according to a finite set of
composition classes. This technique also is independent of the source statistics and
is conceptually related to the approach in this section. In all cases the purpose is to
encode unknown or nonergodic sources which often may be characterized as
sources with unknown parameters. The main result of these two sections is that
these universal coding techniques can asymptotically do as well as when we know
the unknown parameter exactly. A secondary purpose of this section is to demon-
strate that, unlike stationary ergodic sources, there is no single function for noner-
godic stationary sources which plays the role of the rate distortion function.
8.7 BIBLIOGRAPHICAL NOTES AND REFERENCES
Sources with memory were also first treated by Shannon [1948, 1959]. The calcu-
lation of the rate distortion function for discrete-time Gaussian sources is due to
Shannon [1948], and the rate distortion function for a Gaussian random process is
due to Kolmogorov [1956]. Sakrison and Algazi [1971] extended this to Gaussian
random fields. Except for the Gaussian sources with squared-error distortions, the
evaluations of rate distortion functions are difficult, and various bounds due to
several researchers, have been developed.
The robust source encoding of fixed composition sequences presented here
appears in Berger [1971], while the techniques of universal coding are due to Ziv
[1972], Davisson [1973], and Gray and Davisson [1974].
APPENDIX 8A CHERNOFF BOUNDS FOR
DISTORTION DISTRIBUTIONS
8A.1 SYMMETRIC SOURCES
For the symmetric source defined by (8.5.1), (8.5.2), and (8.5.3) we have the rate
distortion function given parametrically by (7.6.69)
and (7.6.70)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 535
where s <0. We now bound F,(D) = Pr {dy(u, v) < D|u} where v has uniform
probability distribution P,(v) = 1/A%. Using E{ - } for expectation with respect to
v for any « > 0, we have the Chernoff bound
F,(D) = Pr {dy(u, v) < Du}
= Py ls 1 2, d(u —ND<0 ,
= den . _ “be Vn) | | ,
as ewe [Teo t) a]
neff tee)
= exp \- N [ap + In A—In les ‘|| (8A.1)
By choosing «a = —s > 0, where s satisfies (7.6.69) and (7.6.70), we have the bound
Piayce (8A.2)
To derive a lower bound to F,(D), define, for any B < 0,
oe |
p(B) = inf 5 a | (8A.3)
=1
and note that
A
Yd, eh
u(B) = 4-— (8A.4)
edt
2,
and
u'(B) sais ae — k c
ePdk ePdk
2 a
A 2
1 A 2. d i
=; >, 1d, — (8A.5)
536 SOURCE CODING FOR DIGITAL COMMUNICATION
Here since d, < dy for k= 1, 2,..., A
0<p"(B) < do
For each u € &%, define a tilted probability on Y given by
: obd(u, v)
P(v|u) uate |
A eFdk
k=1
_ (Ayers
px (1/A)e yea
an P(v)e’™ v)— u(B)
Given ue Wy, the tilted distribution for v €¢ Y'y becomes
Note that for this tilted distribution
D Palv |u)dy(u, v) = p(B)
and
Y Pa(vluy(dy(a, v) — (7 =
_ 46
ay
Given any ¢ > 0, we then have the bounds
Pr {dy(u, v) < w'(B) + €{u} = » Py(v)
dy(u, v)<u'(B)+€
Zs oy Py(v)
|dy(u, v)— 1’(B)|<e
= » Py(v|u)e
ldn(u, v)— 1'(B)| Se
= ge NUBu'(B)— u(B)]
ldn(u, v)— u(B)| Se
> e~ NIBH(B)— UB oBNe
Py(v|u)e
— BNdy(u, v) + Nu(B)
Py(v|u)
|dn(u, v)— n'(B)| <€
(8A.6)
(8A.7)
(8A.8)
(84.9)
(8A.10)
— BNIidn(u, v)— u'(B)]
(8A.11)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 537
The Chebychev inequality (see Prob. 1.4) gives
u'(B)
Py(v|u) > 1
|dn(u, v)— u'(B)|<e " | ) Ne?
= ie (8A.12)
= Ne?
since dy(u, v) has mean y’(f) and variance y"(B)/N over the tilted distribution.
Here (8A.11) becomes
do
Pr {dy(u, v) < w'(B) + elu} > (1 - aaa (8A.13)
When D> D,,,, and «€>0 is small enough so that D— «> D,,,, we can
choose Pf < 0 to satisfy
D—«=y4'(B)
> der
= (8A.14)
A
ePdk
>
Let s satisfy (7.6.69) so that p(s) = D. Then
iO
| H"@) da deo = n(B) — u(s) — (6 - s)u'(s) (8A.15)
Since "(a~) > 0 we have
u(B) > u(s) + (B— s)u"(s) (8A.16)
so that subtracting Bu'(B) = BD — Be = Bu'(s) — Be from both sides gives
u(B) — Bu'(B) = u(s) — su'(s) + Be
= —sD—In A+ Io 5 et) + pe
= —R(D) + Be (8A.17)
where we use (7.6.70). Using (8A.14) and (8A.17) in (8A.13), we get the desired
result
Fy(D) = Pr {dy(u, v) < D|u)}
2
> (1 ~ a a (8A.18)
€
538 SOURCE CODING FOR DIGITAL COMMUNICATION
8A.2 BINARY ALPHABET COMPOSITION CLASS @,(1/)
We have a source alphabet W = {0, 1}, a representation alphabet ¥ = {0, 1}, and
error distortion measure d(k, j) = 1 — 6,; for k, j = 0, 1. For fixed integers N and
| < N, define as in (8.5.23), (8.5.24), and (8.5.25)
(1) = {u: ue Wy, w(u) = |}
ul u=1
ol) =<"
1 “9
oN
l l l
- QM) = a < D <min {—, 1 — —
R(D; Q") (i) H(D) O<D<min fa i]
Now pick any 0 <6 < In 2, fixed rate R such that 6 < R < In 2, and choose
0 <«<0.3 to satisfy (8.5.27). Assume / is such that there exists a D, > € where
from (8.5.35) R = R(D,; Q®) + 6
We now find bounds as in (8.5.36) and (8.5.37) for Pr {d,(u, v) < D,|u € @y(I)}
where v € ¥Y y has probability distribution
Pylv) = [] Pe,
where
Py) = ( = =} Pv |0) + < P%p |1)
and P(v|u) is the conditional probability distribution yielding the rate distortion
function, R(D,; Q) = 1(P).
Using E{-} for expectation with respect to v, for any s <0 we have the
Chernoff bound
Pr {d,(u, v) < D,|u € Gy(I)} < Efe 4 a € Gy (I)}
N
ves o- NeDig| Hi ot © ae 6x(0)
n=1
sats e NsDi I Efe Un)
n=1
=e nie +; NT ELes4. aS os rip
= e 8PPOO)Je* + P(1)}[PO) + P&(Le}’~!
ue Gy(I)}
= €xp (—}s0, ~ In [P°O)es + P®(1)]
a (1 i a} In [P(0) + Pre (8A.19)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 539
We choose s to satisfy the parametric equations for D = D, and R(D,). From
(7.6.58), we have
(1 i y}t + e°) = PO) + PY Le
(1 + e) = P&O)es + P(1) (8A.20)
Zl
and
(8A.21)
D iu In [POO)e* + P(1)] — (1 = "”, In [P(0) + P(e"
=p, wll! T=a)] (tw) (NI) +=)
ee a wile(1+ 725) (1 ae 1 N orp,
ee ge: 1-5 )1 Vee + D, In D, + (1 — D,) In (1 — D,)
=) Re de N N ae
aa 3 #(D,) = R(D;; Q”) (8.22)
and
Pr {d,(u, v) < D,|u € €y(I)} < e7 NROK QM (8A.23)
To derive a lower bound, we first define, for any 8 < 0 and ue @,(I)
u(B|u) =~ In [Bfe%**™ Jue 6(0)}]
=5 In i E{ehaum v=) y 6x(0)
E 4 In [PO)e? + P(L)] + ( = y} In [P(0) + P(1)e#] (84.24)
Derivatives with respect to f are
H B19) = (poe Pom) * (!~ Ww) pays pre) 25)
and
u’(B\u) =
] | P(O)e? P(O)e? 4
N \P&(O 0 Je? + P(1) ae poe | |
P(1)e? | P(1)e? |
j (8A.26)
- x) Pr (0) + P(1)e% P®(0) + PM (1)e?
540 SOURCE CODING FOR DIGITAL COMMUNICATION
Here we have
0<y'(B\u) <1 (8A.27)
For a given u € @j/(I), define a tilted probability on Wy given by
Piven v)
v(v|u) = d Pi ye Pew v’)
= P,(v)eN bantu, v)— Nu(Blu) (8A.28)
For this tilted distribution, we have
d Py(v|u)dy(u, v) = w'(B|u) (8A.29)
and
> Py(v|u)(dy(u, v) — n’(B |u))? eal <* (8A.30)
Now following the same inequalities as in (8A.11), (8A.12), and (8A.13), we
have
4
ue (0) > (1 —
Pr ide v= u (Blu) +— as.
|
Je N(Bu'(Bla)— u(Blu)] GBNe/2
(8A.31)
Since D, > € > 0, we have D, — ¢/2 > «/2 > 0, and we can choose f to satisfy the
parametric equations for D, — «= D, and R(D,; Q”). Hence from (7.6.58) we
have
(1 : nie + ef) = PO) + PM L)e!
(8A.32)
© (1+ ef) = PO)! + PL)
and
faa
P“(0) rip
(8A.33)
los
A 1—eé
giving us the relationships
B
ee
a ee : (8A.34)
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 541
and
u'(B|u) — u(B|u) = (5) = x(D, = 4
” R(D, 55 o"| (8A.35)
Choosing the parameter s to satisfy D, = D,, we then have, as in (8A.17)
u(B|u) — Bu'(B|u) = p(s|u) — sy(s|u) + ©
= —R(D,; Q”) +o _ (8A.36)
Using (8A.35) and (8A.36) in (8A.31) results in the lower bound
4
Pr {d,(u, v) < D,|u € @y(I)} = ( - wale monet (8A.37)
From (8A.34) we have
D5
B =n
= (D.-5]
€
int =
: [2 5}
>In 5 (8A.38)
since D, — ¢/2 > ¢/2 > 0. Hence (8A.37) becomes
xa) eo NIR(Di; Y)~ € In (€/2)] (8A.39)
Pr {dy(u, v) < D,|u € @y(I)} = (1 ae
PROBLEMS
8.1 Consider L independent memoryless discrete-time Gaussian sources in the multiple-source—user
system of Fig. 8.1. Let a7, 63, ..., 07 be the output variance of the sources, and for some positive
weights w,, w,,..., w, define the sum distortion measure
L
du, v) =, w(u? — 2)
l=1
where u" is the [th source output symbol. Find a parametric form for the rate distortion function in
terms of the variances and weights.
542 SOURCE CODING FOR DIGITAL COMMUNICATION
8.2 (a) In (8.2.66), for D < min {A,, 2,, ..., 4,}, show that
i.
R,(D) = — hn —
(D) aL n D
where ® is the covariance matrix defined in (8.2.53) and 4,, A,, ..., A, are its eigenvalues.
(b) In (8.2.66), let 0 = max {A,, A,,..., 4,} and show that
D, aE ian
and
R(D,) =0
8.3 Verify Eq. (8.2.72) by following the proof of the source coding theorem in Sec. 7.2.
8.4 Consider a discrete-time first-order Gaussian Markov source with
o, fatwa eaia 0<p<i a: ie o> ae
For the squared-error distortion show that
—
p=
D 1+ p
(For large D, see Berger [1971], example 4.5.2.2.)
8.5 For any discrete-time zero-mean stationary ergodic source with spectral density ®(w) and the
squared-error distortion measure, show that the rate distortion function is bounded by
2
o
R(D) <4 In —
(D) <4n —
where
Generalize this for continuous-time stationary sources where
O(w)=0 for |w| > a
8.6 Generalize the lower bound given in Theorem 8.3.2 to
R(D)>R,(D)+h—h(%,) for any integer L
8.7 Prove the generalized Shannon lower bound given by (8.3.18).
8.8 For a stationary discrete-time Gaussian source with spectral density function ®(w), show that the
differential entropy rate is
h = 4 In (27eE)
where
1 n
E = exp a | In O(@) dw
1
ee 3
8.9 Suppose we have a continuous-time Gaussian Markov source with spectral density
wore afe(ey
where A is a normalizing constant that satisfies
1 ie)
2
=~ | O(w) dw
=
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 543
For this case show that for the squared-error distortion measure
@
R(D) = — (6 — tan“ 8)
4
where f satisfies
D=o7|1+ Altes: —tan™! s)|
Here 1 + B? = A/@ = 207/(wo 9), where @ is the usual parameter (Berger [1971]).
8.10 For the continuous-time Gaussian process with the squared-error distortion measure discussed in
Sec. 8.4.3, derive the source coding theorem in the same manner as shown in Sec. 8.2 for the discrete-
time Gaussian process. That is, derive a bound similar to (8.2.88).
8.11 Verify that (8.5.15) follows from (8.5.7).
8.12 Show that
C
H(e)=elne-—(l—e) In (1—e)>—eln 5
for «€ < 0.3
8.13 Prove (8.5.50) by using the converse source coding theorem (Theorem 7.2.3) and the law of large
numbers in
N
LOad-ag*"D= YL al-aq" "D+ YL W*a(t-4)*D,
1=0 |l—ngq| <ny |l—ngq| > ny
< Dug+) + * (‘)q'(1 o% 7
|1—nq| > ny
for any y > 0.
8.14 (Generalization of Sec. 8.5.2) Consider source alphabet W = {a;, az,..., a4}, representation
alphabet ¥ = {b,, b,, ..., bg}, and distortion {d(u, v): ue W, v € ¥} such that D,,;,, = 0. Next define,
for any ue @,, the numbers
n(a, |u) = number of places where u, = a, eae Res AB re. |
define the composition vector
C(u) = (n(a, |u), n(a,|u), ..., n(a4|u))
and define the composition classes
@ (I) = {u: C(u) = C} Fad? 5, By
where L, is the number of distinct compositions of output sequences of length N. For the /th composi-
tion C, = (m,, m,, --., ™,), define the probability
n
eS et bet 2. A
Q (a,) N
and the rate distortion function R(D; Q"), which is the rate distortion function of a memoryless source
with output probability distribution Q”.
(a) Show that Ly <(N + 1)47'.
(b) Pick 6 > 0, € > 0, and rate R such that
6 < R < max R(e; Q)
Q
and let Q* satisfy
R = R(e; Q*) + 6
544 SOURCE CODING FOR DIGITAL COMMUNICATION
For a fixed composition class @,(/) where R(e; Q*) < R(e; Q™), define D, > € such that
R = R(D,; Q®) +6
Generalize Lemma 8.5.2 and Theorem 8.5.2 for this case.
(c) Construct composite codes and show that, if the source is memoryless, these composite codes
can approach the rate distortion limit.
8.15 For a memoryless source with source alphabet %, probability {Q(u): ue %}, representation
alphabet ¥, and distortion measure {d(u, v): ue W, v € V}, let R(D) be the rate distortion function
and define
P,(R, D) = min Pr {d(u| 4) > D| B}
B
where the minimization is over all codes of block length N and rate R > R(D). Define the exponent
1
F(R, D) = — lim — In P,(R, D)
No N
(a) Let Q’ be any other source probability distribution and R(D; Q’) the corresponding rate
distortion function. Show that
F(R, D) = if R > max R(D; Q’)
Y
Note that for symmetric sources with balanced distortions
R(D) = max R(D; Q’)
9
and thus F(R, D) = 00 for all R > R(D).
Hint: Prob. 8.14 shows that, if R > maxg R(D; Q’), then all sequences can be encoded with
distortion less than or equal to D for large enough N.
(b) Consider the composition classes defined in Prob. 8.14. Sterling’s formula gives
Pr {u é € y(I)} = bs Q(a,)""Q(a,)"o2!™) se Q(a,)"s"™
uc €y(l)
_ nt O(a, O(a, --- Ola,
~—— n(a, |u)!n(a,{u)! --- n(a,{u)!
= e~ NU(Q, Q)+a(N))
where
Q°(u)
Q(u)
J(Q°, Q = y Qu) In
Use the results of Prob. 8.14 to show that, for
R(D) < R < max R(D; Q’)
Q
we have
P,(R, D) < Pr {ue @,(I): R(D) < R(D; Q), 1 = 1, 2, ..., Ly}
< 3 Pr {ue GY}
l: R(D) < R(D, Q@)
eo NU(Q®, Q)+ a(N))
A
1: R(D) < R(D, @)
Then show that for any 6 > 0
F(R, D) > min J(Q, Q)
Q
where Q satisfies R(D) = R — 6 < R(D; Q).
RATE DISTORTION THEORY: MEMORY, GAUSSIAN SOURCES, AND UNIVERSAL CODING 545
(c) For R(D) < R < max R(D; Q’), let Q be any probability distribution such that
g
R < R(D; Q)
Use the converse source coding theorem (Theorem 7.2.3) to show that there exists an a > 0 (indepen-
dent of N) such that any code & of rate R satisfies
Pr {d(u|Z) > D|B} >a
Here Pr { - } is the probability using distribution Q.
(d) Next show that, for any y > 0 and any code & of block length N and rate R such that
R(D) < R < R(D; Q), we have
2
Pr {d(u|#) > D| A} = (2 - Jems Q+7
y
where
Als > in ot) 5 | 6(u)
Hint: Define the region
1 O,(u)
—lIn
N Qy(u)
and obtain the lower bound to Pr {d(u| 4) > D|} by restricting the summation to this subset of
outputs. Then lower-bound Q,(u) by O,(u) exp {— N[J(Q, Q) + y]} and use both (c) above and the
Chebyshev inequality.
(e) Combine the above upper and lower bounds to P,(R, D) to show that
G = fe:
7
= (0, Q)| <7}
F(R, D) = min J(Q, Q)
Q
where Q satisfies R < R(D; Q) and where
R(D) < R < max R(D; Q’)
~
8.16 In (8.5.68) we showed that, for the binary symmetric source with error distortion, linear codes can
achieve the rate distortion limit. Consider using the linear (7, 4) Hamming code for encoding source
sequences of block length N = 7. This code has rate R = (4/7) In 2 nats per source symbol. Find the
average distortion using this code and compare it with the rate distortion limit.
8.17 Show that, for the set of joint distributions Q(u, iz) on {0, 1} x {0, 1} where
p = Q(1, 0) + QUI, 1)
p = Q(1, 1) + Q(0, 1)
for given p and p, d(p, p) as defined in (8.6.32) becomes
d(p, 6) = |p — B|
8.18 Let Y and ¥Y be two memoryless sources that differ only in their source probabilities,
{Q(u): u € @ and {O(ix): 2 € YW}. Let Y = W be a common representation alphabet and suppose that
the distortion measure {d(u, v): u € UW, v € ¥} satisfies the triangle inequality
d(x, y) < d(x, z)+d(z,y) forallx,yze®W
and the symmetry condition
d(u,u)=d(u,u) foralluuae®
546 SOURCE CODING FOR DIGITAL COMMUNICATION
For all joint distributions {Q(u, u): u, u € %} where
Q(u)=> Q(u,u) forallue &
and
O(i)=> Q(u, a) forallie w
define the distance between the two sources as
AS, f)=min YY Q(u, ii) d(u, a)
8 Bea See
Show that (8.6.37) generalizes to
\(A|S)— da|P)| < dv, Z)
where d(#|./) and d(@| ) are the average distortions for sources Y and ¥ respectively when using
the same block code &. This is the general “mismatch theorem” for memoryless sources. If R(D; Q)
and R(D; Q) are rate distortion functions of the two sources, show that
R(D + d(¥, ); Q) < R(D; Q) < R(D — d(¥; F); Q)
This is the general form of (8.6.38).
BIBLIOGRAPHY
Abramson, N. (1963), Information Theory and Coding, McGraw-Hill, New York.
Acampora, A. S. (1976), “ Maximum-Likelihood Decoding of Binary Convolutional Codes on Band-
Limited Satellite Channels,” Conf. Rec., National Telecommunication Conference.
Anderson, J. B., and F. Jelinek (1973), “A 2-Cycle Algorithm for Source Coding with a Fidelity
Criterion,” [EEE Trans. Inform. Theor., vol. IT-19, pp. 77-91.
Arimoto, S. (1976), “Computation of Random Coding Exponent Functions,” [EEE Trans. Inform.
Theor., vol. IT-22, pp. 665-671.
Arimoto, S. (1973), “On the Converse to the Coding Theorem for Discrete Memoryless Channels,”
IEEE Trans. Inform. Theor., vol. IT-19, pp. 357-359.
Arimoto, S. (1972), “An Algorithm for Computing the Capacity of Arbitrary Discrete Memoryless
Channels,” [EEE Trans. Inform. Theor., vol. IT-18, pp. 14—20.
Arthurs, E., and H. Dym (1962), “On the Optimum Detection of Digital Signals in the Presence of
White Gaussian Noise—A Geometric Interpretation and a Study of Three Basic Data Trans-
mission Systems,” IRE Trans. Commun. Syst., vol. CS-10, pp. 336-372.
Ash, R. B. (1965), Information Theory, Interscience, New York.
Berger, T. (1971), Rate Distortion Theory, Prentice-Hall, Englewood Cliffs, N. J.
Berlekamp, E. R. (1968), Algebraic Coding Theory, McGraw-Hill, New York.
Blahut, R. E. (1974), “Hypothesis Testing and Information Theory,” [EEE Trans. Inform. Theor.,
vol. IT-20, pp. 405-417.
Blahut, R. E. (1972), “Computation of Channel Capacity and Rate-Distortion Functions,” JEEE
Trans. Inform. Theor., vol. IT-18, pp. 460-473.
Blake, I., and R. C. Mullin (1976), An Introduction to Algebraic and Combinatorial Coding Theory,
Academic, New York.
Bode, H. W., and C. E. Shannon (1950), “A Simplified Derivation of Linear Least-Squares Smoothing
and Prediction Theory,” Proc. IRE, vol. 38, pp. 417-425.
Brayer, K. (1971), “Error-Correcting Code Performance on HF, Troposcatter, and Satellite Chan-
nels,” IEEE Trans. Commun. Technol., vol. COM-19, pp. 835-848.
547
548 PRINCIPLES OF DIGITAL COMMUNICATION AND CODING
Bucher, E. A., and J. A. Heller (1970), “Error Probability Bounds for Systematic Convolutional
Codes,” IEEE Trans. Inform. Theor., vol. IT-16, pp. 219-224.
Bussgang, J. J. (1965), “Some Properties of Binary Convolutional Code Generators,” IEEE Trans.
Inform. Theor., vol. IT-11, pp. 90-100.
Campbell, F. W., and J. G. Robson (1968), “Application of Fourier Analysis to the Visibility of
Gratings,” J. Physiol., vol. 197, pp. 551-566.
Courant, R., and D. Hilbert (1953), Methods of Mathematical Physics, vol. 1, Wiley-Interscience, New
York.
Darlington, S. (1964), “Demodulation of Wideband, Low-Power FM Signals,” Bell Syst. Tech. J.
vol. 43, pp. 339-374.
Davisson, L. D. (1973), “Universal Noiseless Coding,” IEEE Trans. Inform. Theor., vol. IT-19,
pp. 783-795.
Elias, P. (1960), unpublished. (See Berlekamp, E. R. [1968].)
Elias, P. (1955), “Coding for Noisy Channels,” [RE Conv. Rec., pt. 4, pp. 37-46.
Fano, R. M. (1963), “A Heuristic Discussion of Probabilistic Decoding,” [EEE Trans. Inform. Theor.,
vol. IT-9, pp. 64-74.
Fano, R. M. (1961), Transmission of Information, MIT Press, Cambridge, Mass., and Wiley, New York.
Fano, R. M. (1952), “Class Notes for Transmission of Information,” Course 6.574, MIT, Cambridge,
Mass.
Feinstein, A. (1958), Foundations of Information Theory, McGraw-Hill, New York.
Feinstein, A. (1955), “Error Bounds in Noisy Channels without Memory,” IRE Trans. Inform. Theor.,
vol. IT-1, pp. 13-14.
Feinstein, A. (1954), “A New Basic Theorem of Information Theory,” JRE Trans. Inform. Theor.,
vol. PGIT-4, pp. 2-22.
Feller, W. (1957), An Introduction to Probability Theory and its Applications, vol. 1, 2d ed. Wiley,
New York.
Forney, G. D., Jr. (1974), “ Convolutional Codes II: Maximum-Likelihood Decoding” and “ Convolu-
tional Codes III: Sequential Decoding,” Inform. Contr., vol. 25, pp. 222-297.
Forney, G. D., Jr. (1973), “The Viterbi Algorithm,” Proc. IEEE, vol. 61, pp. 268-278.
Forney, G. D., Jr. (1972a), “Maximum-Likelihood Sequence Estimation of Digital Sequences in the
Presence of Intersymbol Interference,” JEEE Trans. Inform. Theor., vol. IT-18, pp. 363-378.
Forney, G. D., Jr. (1972b). “Convolutional Codes II: Maximum Likelihood Decoding,” Stanford
Electronics Labs. Tech. Rep. 7004-1.
Forney, G. D., Jr. (1970), “ Convolutional Codes I: Algebraic Structure,” [EEE Trans. Inform. Theor.,
vol. IT-16, pp. 720-738.
Forney, G. D., Jr., and E. K. Bower (1971), “A High-Speed Sequential Decoder: Prototype Design and
Test,” IEEE Trans. Commun. Tech., vol. COM-19, pp. 821-835.
Gallager, R. G. (1976), private communication.
Gallager, R. G. (1974), “Tree Encoding for Symmetric Sources with a Distortion Measure,” [EEE
Trans. Inform. Theor., vol. IT-20, pp. 65-76.
Gallager, R. G. (1968), Information Theory and Reliable Communication, Wiley, New York.
Gallager, R. G. (1965), “A Simple Derivation of the Coding Theorem and Some Applications,” [EEE
Trans. Inform. Theor., vol. IT-11, pp. 3-18.
Gantmacher, F. R. (1959), Applications of the Theory of Matrices, Interscience, New York.
Geist, J. M. (1973), “Search Properties of Some Sequential Decoding Algorithms,” [EEE Trans.
Inform. Theor., vol. IT-19, pp. 519-526.
Gilbert, E. N. (1952), “A Comparison of Signalling Alphabets,” Bell Syst. Tech. J., vol. 31,
pp. 504-522.
Gilhousen, K. S., J. A. Heller, I. M. Jacobs, and A. J. Viterbi (1971), “ Coding Study for High Data Rate
Telemetry Links,” Linkabit Corp. NASA CR-114278 Contract NAS 2-6024.
Goblick, T. J., Jr. (1962), “Coding for a Discrete Information Source with a Distortion Measure,”
Ph.D. Dissertation, MIT, Cambridge, Mass.
Goblick, T. J., Jr., and J. L. Holsinger (1967), “Analog Source Digitization: A Comparison of Theory
and Practice,” [EEE Trans. Inform. Theor., vol. IT-13, pp. 323-326.
BIBLIOGRAPHY 549
Gray, R. M. (1975), private communication.
Gray, R. M., and L. D. Davisson (1974), “Source Coding without the Ergodic Assumption,” JEEE
Trans. Inform. Theor., vol. IT-20, pp. 502-516.
Gray, R. M., D. L. Neuhoff, and J. K. Omura (1975), “ Process Definitions of Distortion-Rate Func-
tions and Source Coding Theorems,” JEEE Trans. Inform. Theor., vol. IT-21, pp. 524-532.
Grenander, U., and G. Szego (1958), Toeplitz Forms and Their Applications, University of California
Press, Berkeley.
Hardy, G. H., J. E. Littlewood, and G. Polya (1952), Inequalities, 2d ed., Cambridge University Press,
London.
Heller, J. A. (1975), “Feedback Decoding of Convolutional Codes,” in A. J. Viterbi (ed.), Advances
in Communication Systems, vol. 4, Academic, New York, pp. 261-278.
Heller, J. A. (1968), “Short Constraint Length Convolutional Codes,” Jet Propulsion Labs. Space
Programs Summary 37-54, vol. III, pp. 171-177.
Heller, J. A.. and I. M. Jacobs (1971), “ Viterbi Decoding for Satellite and Space Communication,”
IEEE Trans. Commun. Technol., vol. COM-19, pp. 835-848.
Helstrom, C. W. (1968), Statistical Theory of Signal Detection, 2d ed., Pergamon, Oxford.
Huffman, D. A. (1952), “A Method for the Construction of Minimum Redundancy Codes,” Proc. IRE,
vol. 40, pp. 1098-1101.
Jacobs, I. M. (1974), “Practical Applications of Coding,” [EEE Trans. Inform. Theor., vol. IT-20,
pp. 305-310.
Jacobs, I. M. (1967), “Sequential Decoding for Efficient Communication from Deep Space,” [EEE
Trans. Commun. Technol., vol. COM-15, pp. 492-501.
Jacobs, I. M., and E. R. Berlekamp (1967), “A Lower Bound to the Distribution of Computation for
Sequential Decoding,” [EEE Trans. Inform. Theor., vol. IT-13, pp. 167-174.
Jelinek, F. (1969a), “A Fast Sequential Decoding Algorithm Using a Stack,” IBM J. Res. Dev., vol. 13,
pp. 675-685.
Jelinek, F. (1969b), “Tree Encoding of Memoryless Time-Discrete Sources with a Fidelity Criterion,”
IEEE Trans. Inform. Theor., vol. IT-15, pp. 584-590.
Jelinek, F. (1968a), Probabilistic Information Theory, McGraw-Hill, New York.
Jelinek, F. (19685), “Evaluation of Expurgated Bound Exponents,” IEEE Trans. Inform. Theor.,
vol. IT-14, pp. 501-505.
Kennedy, R. S. (1969), Fading Dispersive Communication Channels, Wiley, New York.
Kohlenberg, A., and G. D. Forney, (1968), “ Convolutional Coding for Channels with Memory,” JEEE
Trans. Inform. Theor., vol. IT-14, pp. 618-626.
Kolmogorov, N. (1956), “On the Shannon Theory of Information Transmission in the Case of Contin-
uous Signals,” JRE Trans. Inform. Theor., vol. IT-2, pp. 102-108.
Kraft, L. G. (1949), “A Device for Quantizing, Grouping and Coding Amplitude Modulated Pulses,”
M.S. Thesis, MIT, Cambridge, Mass.
Kuhn, H. W., and A. W. Tucker (1951), “ Nonlinear Programming,” Proc. 2nd Berkeley Symp. Math.
Stat. Prob., University of California Press, Berkeley, pp. 481-492.
Landau, H. J., and H. O. Pollak (1962), “ Prolate Spheroidal Wave Functions, Fourier Analysis, and
Uncertainty-III,” Bell System Tech. J., vol. 41, pp. 1295-1336.
Landau, H. J., and H. O. Pollak (1961), “ Prolate Spheroidal Wave Functions, Fourier Analysis, and
Uncertainty-II,” Bell System Tech. J., vol. 40, pp. 65-84.
Lesh, J. R. (1976), “Computational Algorithms for Coding Bound Exponents,” Ph.D. Dissertation,
University of California, Los Angeles.
Lin, S. (1970), An Introduction to Error-Correcting Codes, Prentice-Hall, Englewood Cliffs, N_J.
Linkov, Yu. N. (1965), “Evaluation of «-Entropy of Random Variables for Small «¢,” Problems of
Inform. Transmission, vol. 1, pp. 12-18. (Trans. from Problemy Peredachi Informatsii, vol. 1,
pp. 18-26.)
Lloyd, S. P. (1959), “ Least Square Quantization in PCM,” unpublished Bell Telephone Lab. memo,
Murray Hill, N.J.
Lucky, R. W., J. Salz, and E. J. Weldon (1968), Principles of Data Communication, McGraw-Hill, New
York.
550 PRINCIPLES OF DIGITAL COMMUNICATION AND CODING
Mackechnie, L. K. (1973), “ Maximum-Likelihood Receivers for Channels Having Memory,” Ph.D.
Dissertation, University of Notre Dame, Indiana.
Martin, D. R. (1976), “Robust Source Coding of Finite Alphabet Sources via Composition Classes,”
Ph.D. Dissertation, University of California, Los Angeles.
Massey, J. L. (1974), “Error Bounds for Tree Codes, Trellis Codes, and Convolutional Codes with
Encoding and Decoding Procedures,” Lectures presented at the Summer School on “ Coding and
Complexity,” Centre International des Sciences Mechaniques, Udine, Italy. (Notes published by
Springer-Verlag.)
Massey, J. L. (1973), “Coding Techniques for Digital Communications,” tutorial course notes, 1973
International Conference on Communications.
Massey, J. L. (1972), “ Variable-Length Codes and the Fano Metric,” IEEE Trans. Inform. Theor.,
vol. IT-18, pp. 196-198.
Massey, J. L. (1963), Threshold Decoding, MIT Press, Cambridge, Mass.
Massey, J. L., and M. K. Sain (1968), “Inverses of Linear Sequential Circuits,” [EEE Trans. Computers,
vol. C-17, pp. 330-337.
Max, J. (1960), “ Quantizing for Minimum Distortion,” JRE Trans. Inform. Theor., vol. IT-6, pp. 7-12.
McEliece, R. J., and J. K. Omura, (1977), “An Improved Upper Bound on the Block Coding Error
Exponent for Binary-Input Discrete Memoryless Channels,” [EEE Trans. Inform. Theor.,
vol. IT-23, pp. 611-613.
McEliece, R. J., E. R. Rodemich, H. Rumsey, and L. R. Welch (1977), “New Upper Bounds on the
Rate of a Code via the Delsarte-MacWilliams Inequalities,” [EEE Trans. Inform. Theor.,
vol. IT-23, pp. 157-166.
McMillan, B. (1956), “Two Inequalities Implied by Unique Decipherability,” IRE Trans. Inform.
Theor., vol. IT-2, pp. 115-116.
McMillan, B. (1953), “The Basic Theorems of Information Theory,” Ann. Math. Stat. vol. 24,
pp. 196-219.
Morrissey, T. N., Jr. (1970), “Analysis of Decoders for Convolutional Codes by Stochastic Sequential
Machine Methods,” JEEE Trans. Inform. Theor., vol. IT-16, pp. 460-469.
Neyman, J., and E. Pearson, (1928), “On the Use and Interpretation of Certain Test Criteria for
Purposes of Statistical Inference,” Biometrika, vol. 20A, pp. 175-240, 263-294.
Odenwalder, J. P. (1970), “Optimal Decoding of Convolutional Codes,” Ph.D. Dissertation, Univer-
sity of California, Los Angeles.
Omura, J. K. (1975), “A Lower Bounding Method for Channel and Source Coding Probabilities,”
Inform. Cont., vol. 27, pp. 148-177.
Omura, J. K. (1973), “A Coding Theorem for Discrete-Time Sources,” IEEE Trans. Inform. Theor.,
vol. IT-19, pp. 490-498.
Omura, J. K. (1971), “Optimal Receiver Design for Convolutional Codes and Channels with Memory
via Control Theoretical Concepts,” Inform. Sci., vol. 3, pp. 243-266.
Omura, J. K. (1969), “On the Viterbi Decoding Algorithm,” [EEE Trans. Inform. Theor., vol. IT-15,
pp. 177-179.
Oppenheim, A. V., and R. W. Schafer (1975), Digital Signal Processing, Prentice-Hall, Englewood
Cliffs, N.J.
Peterson, W. W. (1961), Error-Correcting Codes, MIT Press, Cambridge, Mass.
Peterson, W. W., and E. J. Weldon (1972), Error-Correcting Codes, 2d ed., MIT Press, Cambridge,
Mass.
Pilc, R. (1968), “The Transmission Distortion of a Source as a Function of the Encoding Block
Length,” Bell Syst. Tech. J., vol. 47, pp. 827-885.
Pinkston, J. T. (1966), “Information Rates of Independent Sample Sources,” M.S. Thesis, MIT,
Cambridge, Mass.
Plotkin, M. (1960), “Binary Codes with Specified Minimum Distance,” IRE Trans. Inform. Theor.,
vol. IT-6, pp. 445-450, originally Res. Div. Rep. 51-20, Univ. of Penn. (1951).
Ramsey, J. L. (1970), “Realization of Optimum Interleavers,” [EEE Trans. Inform. Theor., vol. IT-16,
pp. 338-345.
BIBLIOGRAPHY 551
Reiffen, B. (1960), “Sequential Encoding and Decoding for the Discrete Memoryless Channel,” MIT
Research Lab. of Electronics Tech. Rept. 374.
Rohlin, V. A. (1967), “ Lectures on the Entropy Theory of Measure-Preserving Transformations,” Russ.
Math. Surv., vol. 22, no. 5, pp. 1-52.
Rosenberg, W. J. (1971), “Structural Properties of Convolutional Codes,” Ph.D. Dissertation, Uni-
versity of California, Los Angeles.
Rubin, I. (1973), “Information Rates for Poisson Sequences,” [EEE Trans. Inform. Theor., vol. IT-19,
pp. 283-294.
Sakrison, D. J. (1975), “Worst Sources and Robust Codes for Difference Distortion Measure,” [EEE
Trans. Inform. Theor., vol. IT-21, pp. 301-309.
Sakrison, D. J. (1969), “An Extension of the Theorem of Kac, Murdock, and Szego to N Dimensions,”
IEEE Trans. Inform. Theor., vol. IT-15, pp. 608-610.
Sakrison, D. J., and V. R. Algazi (1971), “Comparison of Line-by-Line and Two Dimensional Encod-
ing of Random Images,” IEEE Trans. Inform. Theor., vol. IT-17, pp. 386-398.
Savage, J. E. (1966), “Sequential Decoding—the Computation Problem,” Bell Syst. Tech. J., vol. 45,
pp. 149-176.
Shannon, C. E. (1959), “Coding Theorems for a Discrete Source with a Fidelity Criterion,” JRE Nat.
Conv. Rec., pt. 4, pp. 142-163. Also in R. E. Machol (ed.), Information and Decision Processes,
McGraw-Hill, New York, 1960.
Shannon, C. E. (1948), “A Mathematical Theory of Communication,” Bell System Tech. J., vol. 27,
(pt. I), pp. 379-423 (pt. II), pp. 623-656. Reprinted in book form with postscript by W. Weaver,
Univ. of Illinois Press, Urbana, 1949.
Shannon, C. E., R. G. Gallager, and E. R. Berlekamp (1967), “Lower Bounds to Error Probability for
Coding on Discrete Memoryless Channels,” Inform. Contr., vol. 10, pt. 1, pp. 65-103, pt. II,
pp. 522-552.
Slepian, D., and H. O. Pollak (1961), “ Prolate Spheroidal Wave Functions, Fourier Analysis, and
Uncertainty-I,” Bell System Tech. J., vol. 40, pp. 43-64. (See Landau and Pollak [1961, 1962] for
Parts II and III.)
Stiglitz, I. G. (1966), “Coding for a Class of Unknown Channels,” IEEE Trans. Inform. Theor.,
vol. IT-12, pp. 189-195.
Tan, H. (1975), “ Block Coding for Stationary Gaussian Sources with Memory under a Squared-Error
Fidelity Criterion,” Inform. Contr., vol. 29, pp. 11-28.
Tan, H., and K. Yao (1975), “Evaluation of Rate Distortion Functions for a Class of Independent
Identically Distributed Sources under an Absolute Magnitude Criterion,” [EEE Trans. Inform.
Theor., vol. IT-21, pp. 59-63.
Van Lint, J. (1971), Coding Theory, Lecture Notes in Mathematics, Springer-Verlag, Berlin.
Van de Meeberg, L. (1974), “A Tightened Upper Bound on the Error Probability of Binary Convolu-
tional Codes with Viterbi Decoding,” JEEE Trans. Inform. Theor., vol. IT-20, pp. 389-391.
Van Ness, F. L., and M. A. Bouman (1965), “The Effects of Wavelength and Luminance on Visual
Modulation Transfer,” Excerpta Medica Int. Congr., ser. 125, pp. 183-192.
Van Trees, H. L. (1968), Detection, Estimation, and Modulation Theory, Part I, Wiley, New York.
Varsharmov, R. R. (1957), “Estimate of the Number of Signals in Error Correcting Codes,” Dokl.
Akad. Nauk, SSSR 117, no. 5, pp. 739-741.
Viterbi, A. J. (1971), “Convolutional Codes and Their Performance in Communication Systems,”
IEEE Trans. Commun. Tech., vol. COM-19, pp. 751-772.
Viterbi, A. J. (1967a), “Error Bounds for Convolutional Codes and an Asymptotically Optimum
Decoding Algorithm,” [EEE Trans. Inform. Theor., vol. IT-13, pp. 260-269.
Viterbi, A. J. (1967b), “Orthogonal Tree Codes for Communication in the Presence of White Gaussian
Noise,” [EEE Trans. Commun. Tech., vol. COM-15, pp. 238-242.
Viterbi, A. J. (1967c), “Performance of an M-ary Orthogonal Communication System Using
Stationary Stochastic Signals,” [EEE Trans. Inform. Theor., vol. IT-13, pp. 414-422.
Viterbi, A. J. (1966), Principles of Coherent Communication, McGraw-Hill, New York.
Viterbi, A. J., and I. M. Jacobs (1975), “Advances in Coding and Modulation for Noncoherent Chan-
552 PRINCIPLES OF DIGITAL COMMUNICATION AND CODING
nels Affected by Fading, Partial Band, and Multiple-Access Interference,” in A. J. Viterbi (ed.),
Advances in Communication Systems, vol. 4, Academic, New York, pp. 279-308.
Viterbi, A. J., and J. P. Odenwalder (1969), “ Further Results on Optimum Decoding of Convolutional
Codes,” IEEE Trans. Inform. Theor., vol. IT-15, pp. 732-734.
Viterbi, A. J., and J. K. Omura (1974), “ Trellis Encoding of Memoryless Discrete-Time Sources with a
Fidelity Criterion,” [EEE Trans. Inform. Theor., vol. IT-20, pp. 325-331.
Wolfowitz, J. (1961), Coding Theorems of Information Theory, 2d ed., Springer-Verlag and Prentice-
Hall, Englewood Cliffs, N.J.
Wolfowitz, J. (1957), “The Coding of Messages Subject to Chance Errors,” IIl. J. of Math., vol. 1,
pp. 591-606.
Wozencraft, J. M. (1957), “Sequential Decoding for Reliable Communication,” [RE Nat. Conv. Rec.,
vol. 5, pt. 2, pp. 11-25.
Wozencraft, J. M., and I. M. Jacobs (1965), Principles of Communication Engineering, Wiley, New York.
Yudkin, H. L. (1964), “Channel State Testing in Information Decoding,” Sc.D. Thesis, MIT,
Cambridge, Mass.
Zigangirov, K. Sh. (1966), “Some Sequential Decoding Procedures,” Problemy Peredachi Informatsii,
vol. 2, pp. 13-25.
Ziv, J. (1972), “Coding of Sources with Unknown Statistics,” JEEE Trans. Inform. Theor., vol. IT-18,
pp. 384-394.
INDEX
Abelian group, 85
Abramson, N., 35
Acampora, A. S., 286, 287
Additive Gaussian noise channel, 21, 46
AEP (asymptotic equipartition property), 13, 15,
523
AGC (automatic gain control), 80
Algazi, V. R., 510, 534
All-zeros path, 239, 301
Amplitude fading, 107
Amplitude modulation, 50, 76
AND gates, 361
Anderson, J. B., 423
Arimoto, S., 141, 186, 194, 207, 212, 408
Ash, R. B., 35
Associative law, 82
Asymmetric binary ‘‘Z’’ channel, 122
Asymptotic rate, 229
Augmented generating function, 242, 246
Autocorrelation, 508
Autocorrelation function of zero-mean Gaussian
random field, 507
Average:
per digit error probability, 30
per letter mutual information, 485
Average distortion, 387, 389-391, 405, 424, 475,
482, 486, 499
Average error probability, 219
Average length of codewords, 16
Average metric increment, 351
Average mutual information, 22, 24, 25, 35, 38,
134, 141, 387, 394, 426, 431
Average normalized inner product, 121
AWGN (additive white Gaussian noise), 51
AWGN channel, 51, 131, 151, 169, 180, 220, 239,
246, 369
Backward conditional probabilities, 390
Backward test channel, 407, 408
Balanced channel condition, 221
Balanced distortion, 444, 513, 516, 520
Band-limited Gaussian source, 506
Basis, 117
Bayes’ rule, 26, 30, 390
BEC (binary erasure channel), 44, 212
Berger, T., 403, 440, 442, 446, 449, 453, 464, 479,
481, 502, 503, 505, 534, 542, 543
Berlekamp, E. R., 96, 159, 165, 173, 178, 185,
194, 368, 378
Bhattacharyya bound, 63, 88, 192, 212, 244, 302
Bhattacharyya distance, 63, 88, 292, 408, 409,
460
Bias, 350, 474
Binary alphabet composition class, 538
Binary branch vector, 302
Binary entropy function, 10, 33
Binary erasure channel (BEC), 44, 212
Binary feed-forward systematic codes, 253
Binary generator matrix for convolutional code,
228
Binary hypothesis testing, 163
Binary-input channels, 150, 315
AWGN, 214, 239, 247, 248
constant energy AWGN, 239
octal output quantized, 154
553
554 INDEX
Binary-input channels:
output-symmetric, 86, 132, 179, 180, 184, 278,
315, 317, 318, 341, 346
quaternary-output, 123
Binary linear codes, 189
Binary memoryless source, 10
Binary PSK signals, 79
Binary source:
error distortion, 411
with random parameter, 531
Binary symmetric channel (BSC), 21, 151, 218,
235, 246
Binary symmetric source (BSS), 10, 397, 409,
460, 545
Binary-trellis convolutional codes, 301, 311
Binomial distribution, 217
Biphase modulation, 76
Bit energy-to-noise density ratio, 69
Bit error probability, 100, 243, 245, 246, 256, 305,
312, 317, 335, 346
for AWGN channel, 254
Bits, 8, 47
Blahut, R. E., 207, 441, 454
Blake, I., 96
Block code, 50-212, 235, 390, 424
Block coding theorems for
amplitude-continuous sources, 424
Block error probability, 99, 243, 257
Block length, 314
Block orthogonal encoder, 253
Block source code, 389
Bode, H. W., 103
Bouman, M. A., 506
Bounded distortion, 469
Bounded second moment condition, 503
Bounded variance condition, 428
Bower, E. K., 377
Branch distortion measure, 413
Branch metric generator, 334
Branch metrics, 259, 260
Branch observables, 276
Branch synchronization, 261
Branch vectors, 238
Branching process extinction theorem, 461
Brayer, K., 115
BSC (binary symmetric channel), 21, 80, 212,
216, 235, 247
BSS (binary symmetric source), 10, 409
Bucher, E. A., 341
Buffer overflow, 376
Bussgang, J. J., 271, 287
Campbell, F. W., 506
Capacity for AWGN channel, 153
Cascade of channels, 26
Catastrophic codes, 250, 258, 261, 289, 376
Cauchy inequality, 196
Cayley-Hamilton theorem, 291, 294
Central limit theorem, 108
Centroid, 173
Channel capacity, 5, 35, 138, 152, 156, 207, 208,
309, 431
of discrete memoryless channel, 23
Channel encoder and decoder, 5
Channel transition distribution, 55, 79
Channels with memory, 114
Chebyshev inequality, 15, 43, 123, 162, 537, 545
Chernoff bound, 15, 43, 63, 122, 158, 159, 164,
216, 461, 515, 518
Chernoff bounds for distortion distributions, 534
Chi-square distribution, 112
Class of semiorthogonal convolutional
encoders, 298
Code-ensemble average, 475
Code-ensemble average bit error bound, 340,
377
Code generator polynomials, 250, 289
Code generator vectors, 524
Code state diagram, 239, 240
Code synchronization, 258, 261
Code trellis, 239
Code vector, 82, 129, 189
Coded channels without interference, 285
Codeword, 11, 389
Codeword length, 16
Coherence distance of field, 510
Coherent channel:
with hard M-ary decision outputs, 221
with unquantized output vectors, 221
Coherent detection, 124
Colored noise, 102, 125
Commutative law, 82
Compatible paths, 306
Complete basis, 102
Complexity:
for sequential decoding, 375
for Viterbi decoding, 374
Composite code, 522, 523, 530, 532, 544
for overall stationary source, 531
Composite source, 527, 528
Composition class, 518, 521, 543
Computational algorithm:
for capacity, 207
for rate distortion function, 454
Concave functions, 30
Connection vectors, 302
Constraint length, 229, 230, 235, 248
of trellis code, 411
Context-free distortion measure, 387
Continuous (uncountable), 5
Continuous amplitude discrete time memoryless
sources, 388, 460, 464
Continuous amplitude sources, 423, 480
Continuous amplitude stationary ergodic
sources, 485
Continuous phase frequency shift keying, 127
Continuous-time Gaussian Markov source, 542
Continuous-time Gaussian process,
squared-error distortion, 503
Continuous-time Gaussian sources, 493
Continuous-time sources, 479
Converse to coding theorem, 6, 28, 30, 34, 35,
186
Converse source coding theorem, 400, 401, 406,
410, 427, 460, 484, 543
Converse source coding-vector distortion, 478
Convex cap (M) functions, 35, 37
Convex cup (U) functions, 35, 37, 439
Convex functions, 35
Convex region, 37
Convolutional channel coding theorem, 313
Convolutional code ensemble performance, 301,
337
Convolutional coding lower bound, 320
Convolutional lower-bound exponent, 320-321
Convolutional orthogonal codes on AWGN
channel, 315
Convolutional orthogonal encoder, 254
Correlation functions, 511
Countably infinite size alphabet, 464
Covariance matrix, 490, 542
Cramer’s theorem, 464
Critical length, 323
Critical run length of errors, 342
Current, R., 464
Cyclic codes, 96
Darlington, S., 71
Data buffer in Fano algorithm, 376
Data compression schemes, 503
Data processing system, 27
Data processing theorem, 27, 406, 460
Data rate per dimension, 132
Davisson, L. D., 529, 534
Decision rule, 55, 273
Decoder speed factor, 376
Degenerate channels, 315
Deinterleaver, 116
Destination, 4, 385
Difference distortion measure, 452
Differential entropy, 450—452, 496, 542
Differential phase shift keying, 107
Digital delay line, 228
INDEX 555
Dirac delta function [6(-)], 51, 272
Discrete alphabet stationary ergodic sources, 19
Discrete memoryless channel (DMC), 20, 207,
217
Discrete memoryless source (DMS), 8, 388
Discrete stationary ergodic sources, 34
Discrete-time continuous amplitude
memoryless source, 423
Discrete-time first-order Gaussian Markov
source, 542
Discrete-time stationary ergodic source, 480,
500, 502, 542
Discrete-time stationary sources with memory,
479
Discrimination functions, 219
Disjoint time-orthogonal functions, 50
Distortion, 423
Distortion matrix, 443, 514
Distortion measure, 386, 387, 413, 428, 504, 508
Distortion rate function, 388
Distribution of computation, 356
Diversity transmission, 110
DMC (discrete memoryless channel), 20, 28, 79
DMS (discrete memoryless source), 8, 28
Dual code, 94
Dummy AWGN channel, 220
Dummy BSC, 219
Dummy distribution, 164, 166, 169
Duobinary, 151, 279, 340, 341
Dynamic programming, 287
Effective length, 329
Efficient (near rate distortion limit), 523
Elias, P., 138, 184, 286
Elias upper bound, 185, 344
Encoding delay, 29
Energy:
per signal, 158
per transmitted bit, 96
Entropy of DMS, 8, 34, 35
Entropy function, 37
Entropy rate power, 497
Envelope function of unit norm, 103
Equal energy orthogonal signals, 65
Erasure channel, Q-input, (Q + 1)-output, 214
Ergodic source, 480
Ergodicity, 480
Error distortion, 442
Error event, 322
Error run lengths, 324
Error sequence, 278, 335
Error sequence generator, 334
Error signals, 277
Error state diagram, 279
556 INDEX
Euclidean distance, 87
Exponential source, magnitude error distortion,
449
Expurgated bound, 144, 146, 157, 217, 219
Expurgated ensemble average bound, 143, 152
Expurgated exponent, 157, 409, 460
Fading channel, 114
Fano, R. M., 34, 35, 68, 116, 138, 169, 186, 194,
287, 350, 370, 496
Fano algorithm, 370—378
Fano metric, 351, 380
Feed-forward logic, 251
Feedback decoding, 262—272, 289
Feinstein, A., 35, 138
Feller, W., 461
Fidelity criterion, 385, 388
Finite field, 311
Finite state machine, 230, 298
First-order differential entropy rate of source,
496
First-order rate distortion function, 495
Fixed composition class, 544
Fixed-composition sequences, 517
Forbidden trellis output sequence, 414
Forney, G. D., Jr., 75, 115, 251, 252, 272, 287,
295, 324, 341-343, 371, 377, 378
Free distance, 240, 252, 264
Frequency-orthogonal functions, 70
Gallager, R. G., 35, 65, 96, 103, 116, 133, 134,
138, 146, 159, 164, 165, 172, 173, 178, 185,
186, 194, 202, 215, 272, 370, 372, 373, 378,
403, 423, 430, 453, 459, 463, 467, 481
Gallager bound, 65, 68, 96, 129, 306, 316
Gallager function, 133, 306, 360, 393
Gallager’s lemma, 137
Gantmacher, F. R., 338
Gaussian image sources, 494, 506
Gaussian integral function Q(-), 62
Gaussian processes, 108
Gaussian random field, 506, 534
Gaussian source, 443, 448, 453
Gaussian source rate distortion function, 506
Gaussian vector sources, 473
General mismatch theorem, 546
Generalized Chernoff bound, 520
Generalized Gilbert bound, 460
Generalized Shannon lower bound, 496, 542
Generating function sequence, 248
Generating functions, 240, 241, 244, 252, 255
Generator matrix, 83, 288
Generator polynomials, 250
Generator sequences, 251
Geometric distributions, 465
Gilbert, E. N., 185
Gilbert bound, 185, 186, 224, 321, 344, 409, 410
Gilbert-type lower bound on free distance, 344
Gilhousen, K. S., 377, 378
Goblick, T. J., Jr., 499, 500, 524
Golay code, 89, 98
Gram-Schmidt orthogonalization procedure, 47,
117
Gram-Schmidt orthonormal representation, 273,
277
Gray, R. M., 489, 526, 529, 534
Grenander, U., 492, 497
Group codes, 85
Hamming code, 93, 545
Hamming distance, 81, 236, 239, 244, 262, 409
Hamming single error correcting codes, 93,
115
Hamming weight of binary vector, 85
Hard limiter, 80
Hard quantization, 80, 155, 246
Hardy, G. H., 194
Heller, J. A., 75, 249, 259, 287, 289, 341, 378
Helstrom, C. W., 102, 108
Hilbert, D., 464
Holder inequality, 144, 196, 359, 418, 423, 429,
487
Holsinger, J. L., 499, 500
Homogeneous random field, 508
Huffman, D. A., 13, 17
Hyperplane, 57, 203
Identity vector, 84
Improved Plotkin bound, 223, 224
Inadmissible path, 307
Incorrect subset of node, 243, 354
Independent components:
maximum distortion, 479
sum distortion measure, 471
Independent events, 7, 19
Indicator function, 391, 418, 429, 516, 521
Information in an event, 7
Information sequence, 333
Information theory, 6
Initial synchronization, 328, 341
Input alphabet, 207
Instantaneously decodable code, 12
Integral equation, 507
Intensity function, 506
Interleaving, 110, 115, 116
internal, 272
Intersymbol interference (ISI), 75, 272, 285, 331,
336
Isotropic field, 509
Jacobian, 109
Jacobs, I. M., 63, 75, 112-114, 116, 249, 259,
296, 368, 370, 373, 378
Jelinek, F., 35, 150, 194, 214, 361, 371, 376, 378,
410, 422, 423, 443, 453, 460
Jelinek algorithm, 371, 373
Jensen inequality, 37, 40, 197, 426, 487
Joint source and channel coding theorem, 467
Jointly ergodic pair source, 485, 488
Jointly ergodic process, 486
K-stage shift register, 228
Karhunen-Loeve expansion, 102, 504, 505, 507,
510, 512
Kennedy, R. S., 108
Khinchine’s process, 485
Kohlenberg, A., 115, 272
Kolmogorov, N., 534
Kraft, L. G., 18
Kraft-McMillan inequality, 18
Kuhn, H. W., 141, 202
Kuhn-Tucker conditions, 202
Kuhn-Tucker theorem, 23, 188, 208
Lagrange multipliers, 203, 434, 442, 446
Landau, H. J., 74
Law of large numbers, 543
Lesh, J. R., 141, 212, 410
L’Hospital’s rule, 120, 149, 150
Likelihood functions, 55, 159
for BSC, 169
Limit theorem for Toeplitz matrices, 491
Lin, S., 96
Linear code, 82, 189, 526
Linear convolutional codes, 96
Linear feedback logic, 252
Linear intersymbol interference channels, 284
Linkov, Yu. N., 464
List decoding, 179, 215, 365, 367
Littlewood, J. E., 194
Lloyd, S. P., 499
Lloyd-Max quantizers, 499, 500
Log likelihood ratio, 161, 273
Low-rate lower bound, 321
Lower-bound exponent, 171
Lucky, R. W., 75, 271
INDEX 557
M-level quantizer, 499
McEliece, R. J., 184
Mackechnie, L. K., 287
McMillan, B., 18, 488, 523
Magnitude error distortion measures, 423, 427
Majority logic, 270
Mapping function, 311
Martin, D. R., 517, 523
Massey, J. L., 250, 251, 270, 287, 350, 380, 381
Matched filter, 275
Matched source and channel, 460
Max, J., 499
Maximum distortion measure, 474
Maximum likelihood, 411
Maximum likelihood decision rule, 58
Maximum likelihood decoder, 58, 227, 262
for convolutional code, 235
Maximum likelihood list-of-L decoder, 366
Maximum likelihood trellis decoding algorithm,
239, 411
Memoryless channel, 54, 79, 132, 159
Memoryless condition, 21, 146
Memoryless discrete-input additive Gaussian
noise channel, 21
Memoryless source, 8, 388, 469
Metric, 58, 236, 238, 262, 350
MFSK (m frequency orthogonal signal), 220
Minimax approach, 479
Minimum distance, 244
Minimum distortion path, 416
Minimum distortion rule, 389
Minimum-probability-of-error decision rule, 380
Minkowski inequality, 198
Mismatch equation, 532
Modulo-2 addition for binary symbols, 82
Morrissey, T. N., Jr., 264
MSK (minimum shift keying), 126
Mullin, R. C., 96
Multiple-amplitude modulation, 102
Multiple-phase modulation, 102
Mutual information, 19
Nats, 8
Natural rate distortion function, 409, 460
Neyman, J., 159
Neyman-Pearson lemma, 158-160, 172
Node errors, 243, 255, 301, 305, 362
Noiseless channel, 4, 143, 147
Noiseless source coding theorem, 6, 11-13, 19
Noisy channel, 5
Nonbinary and asymmetric binary channels, 302
Nonbinary modulation, 102
Noncatastrophic codes, 25]
558 INDEX
Noncoherent channel, 221
Noncoherent reception, 104
Nonergodic stationary source, 480, 523, 526
Nonsystematic codes, 377
Nonsystematic convolutional code, 252, 377
Observables, 52—57, 276
Observation space, 56
Octal output quantized AWGN channel, 214
Octal quantization, 155, 214
Odenwalder, J. P., 248, 287, 317, 341
Omura, J. K., 75, 184, 219, 287, 454
One-sided noise power spectral density, 51
One-step prediction error of Gaussian source,
497
Oppenheim, A. V., 71
Optimal code, 13
Optimum decision regions, 56, 187
Optimum decision rule for memoryless channel,
55
OR gates, 361
Orthogonal codes, 98, 255-258
on AWGN channel, 256, 257
Orthogonal convolutional codes, 253, 255, 257
Orthogonal functions, 117
Orthogonal set of equations, 269
Orthogonal signal set, 120, 169
Orthonormal basis functions, 47, 50, 504
Pair process, 485
Pair state diagram, 299
Pairwise error probability, 60, 244, 302
Parallel channels, 215
Pareto distribution, 361, 368, 371, 374, 378
Pareto exponent, 368
Parity-check codes, 85
Parity-check matrix, 91, 265
Pearson, E., 159
Perfect code, 98
Perron-Frobenius theorem, 338
Peterson, W. W., 96, 99, 272
Phase modulation, 76
Pilc, R., 403
Pinkston, J. T., 464
Plotkin, M., 175
Plotkin bound, 175, 184, 344
Poisson distribution, 427, 449, 465
Pollack, H. O., 74
Polya, G., 194
Positive left eigenvector, 340
Predetection filters, 102
Prefix, 12, 181
Prior probabilities, 55
Push down stack, 371
Q(-), 62, 247
Quadrature modulator-demodulators, 71
Quadriphase modulation, 76, 122
Quantization of discrete time memoryless
sources, 498
Quantized demodulator outputs, 259
Quantizer, 4, 78
Quasi-perfect code, 98, 99
Ramsey, J. L., 116
Random-access memory, 116
Random field, 468
Random vector, 468
Rate distortion function, 5, 387, 397, 427, 431,
445, 470, 471, 479, 481, 494, 503, 504
for binary symmetric source, 442
of random field, 508
for stationary ergodic sources, 479, 485, 510
for vector source with sum distortion
measure, 489
Rayleigh distribution, 109, 112
Rayleigh fading, 109
Received energy per information bit, 69
Reduced error-state diagram, 281, 283
Register length, 229
Regular simplex, 95, 169
Reiffen, B., 286
Reliability function, 68
Reliable communication system, 30
Representation alphabet, 387, 389
Robson, J. G., 506
Robust source coding technique, 523
Rodemich, E. R., 184
Rohlin, V. A., 533
Rosenberg, W. J., 251
Rubin, I., 449
Rumsey, H., 184
Run length of errors, 324
Sain, M. K., 250, 251
Sakrison, D. J., 451, 509, 510, 534
Salz, J., 75, 271
Sampling theorem, 72
Savage, J. E., 361, 376, 378
Schafer, R. W., 71
Schwarz inequality, 142, 196
Self-information, 20
Semisequential algorithm, 371
Sequential decoding, 6, 152, 227, 262, 286,
349-379
Shannon, C. E., 4, 13, 17, 19, 35, 103, 128, 138,
159, 165, 173, 178, 185, 194, 385, 451, 481,
506, 534
Shannon lower bound, 452, 463, 464
Shannon’s channel coding theorem, 5, 133
Shannon’s mathematical theory of
communications, 6
Shannon’s noiseless coding theorem, 11, 385,
465
Shift register, 228
Signal representation, 117
Signal set, 129
Signal-to-noise parameter, 67
Signal vector, 129
Single-letter distortion measure, 387, 469, 482
Slepian, D., 74
Slepian and Wolf extension to side information,
466
Sliding block decoder, 264, 271
Soft quantizer, 80, 155
Source, 4
Source alphabet, 387
Source coding model, 387
Source coding theorem, 397, 401, 427, 460, 462,
474
Source decoder, 4, 396
Source distance, 532
Source encoder, 4, 396
Source entropy, 4, 6
Source reliability function, 397
Sources with memory, 479, 494
Spectral density, 494, 511
Spectrum shaping function, 75
Sphere-packing bound, 169-216, 219, 321
Sphere-packing bound exponent, 179, 212, 220
Square-error distortion, 423, 449, 505, 542
Squared error distortion measures, 423, 505
Stack algorithm, 351, 361, 370
Staggered (offset) QPSK (SQPSK), 126
State diagram, 231
State diagram descriptions of convolutional or
trellis codes, 231, 234, 237, 240, 277
State sequences, 335
State transition matrix of intersymbol
interference, 336
Stationarity , 489
Stationary binary source, 529
Stationary ergodic discrete-time sources, 387,
388
Stationary ergodic joint processes, 489
Stationary ergodic source, 42, 480-526
Stationary nonergodic binary source, 529
INDEX 559
Stationary source, 480
Stieltjes integral, 41
Stiglitz, I. G., 221
Stirling’s formula, 544
Strong converse to coding theorem, 186, 408
Suboptimal metric, 259
Sufficient statistics, 54
Suffix, 181
Sum channels, 215
Sum distortion measure, 470-541
Superstates, 348
Surviving path, 236, 239
Symbol energy, 108
Symbol energy-to-noise density ratio, 155
Symbol synchronization, 261
Symbol transition probability, 131
Symmetric sources, 403, 513
with balanced distortion, 443, 462, 513, 516,
544
Synchronization of Viterbi decoder, 260
Syndrome of received vector, 91, 264
Syndrome feedback decoder, 262—272
Syndrome table-look-up procedure, 269
Systematic binary linear code, 223
Systematic code, 90, 91, 251, 264, 268, 365,
377
Systematic convolutional codes, 251, 268, 329,
331
Szego, G., 492, 497
Table-look-up technique, 91
Tail:
of code, 229, 231, 258
of trellis, 412
Tan, H., 448, 449, 453, 465, 493
Threshold-decodable convolutional codes, 270
Threshold logic, 270
Tilted probability, 161, 536, 540
Tilting variable, 161
Time-diversity, 110
Time-invariant (fixed) convolutional codes, 229
Time-orthogonal functions, 70
Time-orthogonal quadrature phase functions, 70
Time-varying convolutional codes, 229,
301-305, 331-346, 357, 361
Toeplitz distribution theorem, 491, 493, 505, 509
Toeplitz matrices, 491
Transition probabilities , 207
Transition probability matrix, 259
Transmission rate, 254
Transmission time per bit, 255
Transorthogonal code, 95
Tree-code representation, 232, 460
560 INDEX
Tree descriptions of convolutional or trellis
codes, 234
Tree diagram, 230, 232, 236
Trellis-code representation, 233
Trellis codes, 234, 264, 401, 411
Trellis diagram, 230-240, 411, 412
Trellis source coding, 412, 414
Trellis source coding theorem, 421, 430
Triangle inequality, 545
Truncated maximum likelihood decision, 268
Truncated-memory decoder, 262
Truncation errors, 327
Tucker, A. W., 141, 202
Two-dimensional spectral density function,
508
Two-dimensional version of Toeplitz
distribution theorem, 509
Unbounded distortion measure, 427
Unconstrained bandwidth, 165, 220
Uniform error property, 86, 278
Uniform quantizers, 499
Uniform source, magnitude error distribution,
449
Union-Battacharyya bound, 63, 67, 244
Union bound, 61, 476
on bit error probability, 316
Uniquely decodable code, 12
Universal coding, 523, 526, 533
Unquantized AWGN channel, 156
Useless channel, 148
User, 385
User alphabet, 387, 388
VA (Viterbi algorithm), 238, 258, 261, 276, 287,
414
Van de Meeberg, L., 289
Van Lint, J., 96
Van Ness, F. L., 506
Van Trees, H. L., 102, 107
Variant:
of Holder inequality, 196
of Minkowski inequality, 199
Varshamov, R. R., 185
Varshamov-Gilbert lower bound, 185
Vector distortion measure, 469, 476
Very noisy channel, 155, 313, 326, 328
Viterbi, A. J., 107, 108, 287, 296, 313, 317, 320,
321, 341, 371, 411, 454
Viterbi decoder, 237—334, 374-378, 411-423
Weak law of large numbers, 43, 447
Weight distribution, 310
Weighted ensemble average, 132
Weighting factors, 279
Welch, L. R., 184
Weldon, E. J., 75, 96, 271, 272
‘*Whiten’’ noise, 103, 501
Whitened matched filter, 295, 298
Wolfowitz, J., 35, 138, 186, 194
Wozencraft, J. M., 63, 112-114, 116, 286, 370,
373, 378
Yao, K., 448, 449, 453, 465
Yudkin, H. L., 364, 370, 378
Z channel, 44, 159, 212, 216
Zero-rate exponent, 152, 178, 318, 321
Zeroth order Bessel function of first kind, 509
Zeroth order modified Bessel function, 105
Zigangirov, K. Sh., 371, 378
Zigangirov algorithm, 371
Ziv, J., 534
oe
Ais
eae
an
tn od
<0
ee
ELvhad
aoe
Nebgly Neod >:
oh eer
sents : ZS,
are i : ; - ee
arctnts Mea aight ; ee ite : ney
att aie
"yom 94} Jo ATOWOAM Suystxe Aue (Jopureutos Aq Surpnyjour) |Jos 0} WSiI oy} surEsoI
THH-MBIQOWL “OL61 ‘ZZ Aine poyep TIH-MeINOY] pue soouSissy oy) UsaMmjoq jusUIDeISe Ue 0} JUBNsInd
(GL6W) ¥67-¥LZ XL QUINN, WOHeASTZ2y 3q%14do>
BIND “YUE pus Iq.1931A *¢ MoUpUY Aq SuIpOD pup uoYNIUNUAUOD ju;nsIG fo safdrouLig
97 ou} Jopun [TH-MeIQo Aq poysyqnd y10m aq}
JO Wq3t1Adoo oy) 0} puv UI TEH-MBINOW] JO JsazoqM pow op} ‘sqyS [ye “AjenEem jnoyyM ‘(_ SISUBSTY,,
OY}) BINWIQ “yf WIE pue IqIONA “f MaIpUY 0} JoysuBy pue UdIsse Aqossy soop “‘pospoymouyoe Aqoisy
St YORAM JO }dla0e1 ‘Woljerop{suoo aqunyea puw pood Joy “OZOOT OA MON OA MON ‘seoLoUry
OY} JO ONUIAY [ZZ] 3B SOONFZO WIM “(H-MEIDIP,,) “Uy ‘Sormedi0; [[TH-MBINHIA] 9q
C SIHOTY AO INAWNOISSY >>
‘ou ‘sormedu10D [[TH-MeINoy] SY].
"S007 ‘toquiadag Jo vp ,.87 ST} WO payndexe ANP Us0q Sey JUSTUUTIsse SI} Joos, SSOUNA\ UO]
‘sosucor Ayred pig} Surpuvysjno Aue 0} yoolgns si yuoUIUBIsse SI, “YIOM BAOgE 3y}
JO} P[JOM 3q} JO SoLyUNOS 10 ATMOS 19430 AUB UI JO SoBIG POPUL VY} Ul Joyo puwe soo; Ul Joyvolsy
JO MOU SMB] 94} Jopun pomoes oq Avur yeu} syqdiAdoo pemousl ul ‘Aue JI jSoJOTT S [TH-MeINO]
[[@ pae ssossod Avur jr JeMoual Jo SPAS [Te SooUTIssy SY} 0} SUBISSE [I}]-MEIHO “Joy,
ISBN O-0?-Ob?S1b-4
90000>
9 "780070°675162 | |